AUX_RECORD_BLANK
AUX_RECORD_BLANK
Represents an empty element in an Auxiliary dictionary entry record 10 bytes long
Data structure used to store for entries of the form: word id, index shard generation, posting list offset, and length of posting list. It has entries for all words stored in a given IndexArchiveBundle. There might be multiple entries for a given word_id if it occurs in more than one index shard in the given IndexArchiveBundle.
In terms of file structure, a dictionary is stored a folder consisting of 256 subfolders. Each subfolder is used to store the word_ids beginning with a particular character. Within a folder are files of various tier levels representing the data stored. As crawling proceeds words from a shard are added to the dictionary in files of tier level 0 either with suffix A or B. If it is detected that both an A and a B file of a given tier level exist, then the results of these two files are merged to a new file at one tier level up . The old files are then deleted. This process is applied recursively until there is at most an A file on each level.
PREFIX_ITEM_SIZE
Size of an item in the prefix index used to look up words.
If the sub-dir was 65 (ASCII A), and the second char was also ASCII 65, then the corresonding prefix record would be the offset to the first word_id beginning with AA, followed by the number of such AA records.
addShardDictionary(object $index_shard, object $callback = null)
Adds the words in the provided IndexShard to the dictionary.
Merges tiers as needed.
object | $index_shard | the shard to add the word to the dictionary with |
object | $callback | object with join function to be called if process is taking too long |
mergeTier(integer $tier, string $out_slot)
Merges for each first letter subdirectory, the $tier pair of files of dictinary words. The output is stored in $out_slot.
integer | $tier | tier level to perform the merge of files at |
string | $out_slot | either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results. |
mergeTierFiles(integer $prefix, integer $tier, string $out_slot)
For a fixed prefix directory merges the $tier pair of files of dictinary words. The output is stored in $out_slot.
integer | $prefix | which prefix directory to perform the merge of files |
integer | $tier | tier level to perform the merge of files at |
string | $out_slot | either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results. |
combineDictionaryRecord(string $record_a, string $record_b, integer $prefix_bit) : string
Used to combine the dictionary records for a given word_id between that come from two different tier files
string | $record_a | a dictionary record including auxiliary records from the 'a'th file of the give tier |
string | $record_b | a dictionary record including auxiliary records from the 'b'th file of the give tier |
integer | $prefix_bit | either 0 or 32768. The first bit of an auxiliary record should be negation of higher order bit of the given prefix letter used by the tier file. |
a single record with merged strings making use of auxliary records as needed containing (generation, posting list offset, length) information.
decodeAuxRecord(string $record_string, string $offset) : array
Used to decode an auxiliary dictionary record associated with a given word_id
string | $record_string | string in which dictionary records occur |
string | $offset | a byte offset into $record_string |
of up to three strings
makePrefixRecord(integer $offset, integer $count) : string
Makes a prefix record string out of an offset and count (packs and concatenates).
integer | $offset | byte offset into words for the prefix record |
integer | $count | number of word with that prefix |
the packed record
mergeAllTiers(object $callback = null, integer $max_tier = -1, boolean $fast_merge_all = false)
Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.
object | $callback | object with join function to be called if process is taking too long |
integer | $max_tier | the maximum tier to merge to merge till -- if not set then $this->max_tier used. Otherwise, one would typically set to a value bigger than $this->max_tier |
boolean | $fast_merge_all | if true then merge away B slots but don't merge everything to a top tier |
getWordInfo(string $word_id, boolean $raw = false, integer $threshold = -1, integer $start_generation = -1, integer $num_distinct_generations = -1, boolean $with_remaining_total = false) : mixed
For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
string | $word_id | id of the word or phrase one wants to look up |
boolean | $raw | whether the id is our version of base64 encoded or not |
integer | $threshold | if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches |
integer | $start_generation | which index shard in inverted index to start search from |
integer | $num_distinct_generations | how many shard to consider after $start_generation |
boolean | $with_remaining_total |
an array of entries of the form generation, first offset, last offset, count, matched_key If also have with remaining true, then get a pair, with second element as above and first element the estimated total number of of docs
getWordInfoTier(string $word_id, boolean $raw, integer $tier, integer $threshold = -1, integer $start_generation = -1, integer $num_distinct_generations = -1) : mixed
This method facilitates query processing of an ongoing crawl.
During an ongoing crawl, the dictionary is arranged into tiers as per the logarithmic merge algortihm rather than just one tier as in a crawl that has been stopped. Word info for more recently crawled pages will tend to be in lower tiers than data that was crawled earlier. getWordInfoTier gets word info data for a specific tier in the index dictionary. Each tier will have word info for a specific, disjoint set of shards, so the format of how to look up posting lists in a shard can be the same regardless of the tier: an array entry is of the form generation, first offset, last offset, and number of documents the word occurred in for this shard.
string | $word_id | id of the word one wants to look up |
boolean | $raw | whether the id is our version of base64 encoded or not |
integer | $tier | which tier to get word info from |
integer | $threshold | if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches |
integer | $start_generation | if positive the first generation to return information about |
integer | $num_distinct_generations | if positive number of then determines the number of generations after the starting generation to return information about |
a pair(total_count, max_found_generation, an array of entries of the form generation, first offset, last offset, count, matched_key) or false if no data
addAuxInfoRecords(string $id, integer $file_num, integer $num_aux_records, \seekquarry\yioop\library\int& $total_count, integer $threshold, \seekquarry\yioop\library\array& $info, \seekquarry\yioop\library\int& $previous_generation, \seekquarry\yioop\library\int& $num_generations, integer $offset, integer $num_distinct_generations, \seekquarry\yioop\library\int& $max_retained_generation, \seekquarry\yioop\library\array& $id_info)
Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.
A typical dictionary entry consists of a 20 byte word id, followed by the 4 bytes ints generation, offset, and length of the posting lists in that generation. If the high bit of the prefix characters in the word id are flipped, it indicates the presence of auxiliary records for that word id. In which case bytes 1, and 2 of the generation, code the number of auxiliary records there will be for this word id. An auxiliary record is 32 bytes long beginning with a bit of the current high prefix letter, followed by a 15 bit code of which aux record in the sequence of aux records for this word id it is, followed by three 10 byte 2byte generation, 4 byte offset, 4 byte len records.
string | $id | word id to add aux records for |
integer | $file_num | which prefix file to read from (always reads a file at the max_tier level) |
integer | $num_aux_records | |
\seekquarry\yioop\library\int& | $total_count | |
integer | $threshold | |
\seekquarry\yioop\library\array& | $info | |
\seekquarry\yioop\library\int& | $previous_generation | |
\seekquarry\yioop\library\int& | $num_generations | |
integer | $offset | |
integer | $num_distinct_generations | |
\seekquarry\yioop\library\int& | $max_retained_generation | |
\seekquarry\yioop\library\array& | $id_info |
formatWordInfo(\seekquarry\yioop\library\int& $total_count, integer $max_retained_generation, array $info) : array
Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple
\seekquarry\yioop\library\int& | $total_count | |
integer | $max_retained_generation | |
array | $info |
resulting triple
addLookedUpEntry(string $id, string $word_id, array $record, \seekquarry\yioop\library\array& $info, \seekquarry\yioop\library\int& $total_count, \seekquarry\yioop\library\int& $previous_generation, \seekquarry\yioop\library\int& $previous_id, \seekquarry\yioop\library\int& $num_generations, integer $num_distinct_generations, \seekquarry\yioop\library\int& $max_retained_generation, \seekquarry\yioop\library\array& $id_info)
This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.
string | $id | of a row to compare $word_id against |
string | $word_id | the word id of a term or phrase we are computing the quadruple array for |
array | $record | current record from dictionary that we may or may not add to info |
\seekquarry\yioop\library\array& | $info | quadruple array we are adding to |
\seekquarry\yioop\library\int& | $total_count | count of items in $info |
\seekquarry\yioop\library\int& | $previous_generation | last generation added to $info |
\seekquarry\yioop\library\int& | $previous_id | last exact if added to $info |
\seekquarry\yioop\library\int& | $num_generations | |
integer | $num_distinct_generations | |
\seekquarry\yioop\library\int& | $max_retained_generation | |
\seekquarry\yioop\library\array& | $id_info |
getDictSubstring(integer $file_num, integer $offset, integer $len) : string
Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary
integer | $file_num | which prefix file to read from (always reads a file at the max_tier level) |
integer | $offset | byte offset to start reading from |
integer | $len | number of bytes to read |
data from that location in the shard
readBlockDictAtOffset(integer $file_num, integer $bytes) : \seekquarry\yioop\library\&string
Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes
integer | $file_num | which dictionary file (given by first letter prefix) to read from |
integer | $bytes | byte offset to start reading from |
data fromIndexShard file