- We create a secondary index (using
- We index by word; the data is a list of hits, each hit is a record ID (primary key) and a list of fields, each field has offsets into the field.
- We probably need a reverse index of record ID to list of unique words (and maybe count). We can use this to handle update/deleted records.
- Extract all words. We end up with a list of unique words, for every word, we have a list of fields/offsets.
- Look up the record to find the current list of words (old version). NOTE: We don't need the old version of the record, we just need the record to word index.
- For all words that are in the old version but not the new version, delete from the word index.
- For all words in the new version, insert or update the record to the word index.
- Update the record to words index.
- We need a list of records updated/deleted.
- For all deleted records, we look up the record in the record to word list and remove each word from the index.
- For all updated record, follow the update procedure above.
NOTE: When running a search, we generate a new index (sorted by relevance) for the record results. We cache this for a search term and allow future requests to use the index. We invalidate the cache when the search index is updated (or after a certain period of inactivity).
- Look up each word in the search term.
- We generate a relevance value for each record.
- When a word matches, we assign it a value from 1-100 based on the commonality of the word in the corpus. E.g., if a word is in 100% of the records, then we rank it 1; if it is in 1% of the records, we rank it 100.
- We add up the word importance value for all matching records.
- We add extra points for closeness. E.g., if the search is "full house" and the two words, "full" and "house" appear next to each other.