The UNIXsystem has many utilities (e.g. grep, awk, lex, egrep, fgrep, ...) to search through files of text, but most of them are based on a linear scan through the entire file, using some deterministic automaton. This memorandum discusses a program which uses inverted indexes %A D. Knuth %T The Art of Computer Programming: Vol. 3, Sorting and Searching %I Addison-Wesley %C Reading, Mass. %D 1977 %O See section 6.5. and can thus be used on much larger data bases.
As with any indexing system, of course, there are some disadvantages; once an index is made, the files that have been indexed can not be changed without remaking the index. Thus applications are restricted to those making many searches of relatively stable data. Furthermore, these programs depend on hashing, and can only search for exact matches of whole keywords. It is not possible to look for arithmetic or logical expressions (e.g. ``date greater than 1970'') or for regular expression searching such as that in lex. lex lesk cstr
Currently there are two uses of this software, the refer preprocessor to format references, and the lookall command to search through all text files on the UNIXsystem.
The remaining sections of this memorandum discuss the searching programs and their uses. Section 2 explains the operation of the searching algorithm and describes the data collected for use with the lookall command. The more important application, refer has a user's description in section 3. Section 4 goes into more detail on reference files for the benefit of those who wish to add references to data bases or write new troff macros for use with refer. The options to make refer collect identical citations, or otherwise relocate and adjust references, are described in section 5. The UNIXmanual sections for refer, lookall, and associated commands are attached as appendices.