Overview

  • We create a secondary index (using CAeonView class).
  • We index by word; the data is a list of hits, each hit is a record ID (primary key) and a list of fields, each field has offsets into the field.
  • We probably need a reverse index of record ID to list of unique words (and maybe count). We can use this to handle update/deleted records.

Algorithms

Insert/Update/Delete

  1. Extract all words. We end up with a list of unique words, for every word, we have a list of fields/offsets.
  2. Look up the record to find the current list of words (old version). NOTE: We don't need the old version of the record, we just need the record to word index.
  3. For all words that are in the old version but not the new version, delete from the word index.
  4. For all words in the new version, insert or update the record to the word index.
  5. Update the record to words index.

Batch Update

  1. We need a list of records updated/deleted.
  2. For all deleted records, we look up the record in the record to word list and remove each word from the index.
  3. For all updated record, follow the update procedure above.

Search

NOTE: When running a search, we generate a new index (sorted by relevance) for the record results. We cache this for a search term and allow future requests to use the index. We invalidate the cache when the search index is updated (or after a certain period of inactivity).

  1. Look up each word in the search term.
  2. We generate a relevance value for each record.
  3. When a word matches, we assign it a value from 1-100 based on the commonality of the word in the corpus. E.g., if a word is in 100% of the records, then we rank it 1; if it is in 1% of the records, we rank it 100.
  4. We add up the word importance value for all matching records.
  5. We add extra points for closeness. E.g., if the search is "full house" and the two words, "full" and "house" appear next to each other.