\seekquarry\yioop\libraryIndexDictionary

Data structure used to store for entries of the form: word id, index shard generation, posting list offset, and length of posting list. It has entries for all words stored in a given IndexArchiveBundle. There might be multiple entries for a given word_id if it occurs in more than one index shard in the given IndexArchiveBundle.

In terms of file structure, a dictionary is stored a folder consisting of 256 subfolders. Each subfolder is used to store the word_ids beginning with a particular character. Within a folder are files of various tier levels representing the data stored. As crawling proceeds words from a shard are added to the dictionary in files of tier level 0 either with suffix A or B. If it is detected that both an A and a B file of a given tier level exist, then the results of these two files are merged to a new file at one tier level up . The old files are then deleted. This process is applied recursively until there is at most an A file on each level.

Summary

Methods
Properties
Constants
__construct()
calculateActiveTiers()
makePrefixLetters()
addShardDictionary()
mergeTier()
mergeTierFiles()
combineDictionaryRecord()
decodeAuxRecord()
extractPrefixRecord()
makePrefixRecord()
mergeAllTiers()
getWordInfo()
getWordInfoTier()
addAuxInfoRecords()
formatWordInfo()
addLookedUpEntry()
getDictSubstring()
readBlockDictAtOffset()
$dir_name
$fhs
$file_lens
$blocks
$max_tier
$read_tier
$active_tiers
$shard_doc_lens
$parent_archive_bundle
AUX_RECORD_BLANK
AUX_RECORD_LEN
SEGMENT_SIZE
DICT_BLOCK_SIZE
DICT_BLOCK_POWER
PREFIX_ITEM_SIZE
NUM_PREFIX_LETTERS
PREFIX_HEADER_SIZE
MAX_DICT_FILE_HANDLES
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

AUX_RECORD_BLANK

AUX_RECORD_BLANK

Represents an empty element in an Auxiliary dictionary entry record 10 bytes long

AUX_RECORD_LEN

AUX_RECORD_LEN

Represents the length of an element in a dictionary aux record

SEGMENT_SIZE

SEGMENT_SIZE

When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)

DICT_BLOCK_SIZE

DICT_BLOCK_SIZE

Size in bytes of one block in IndexDictionary

DICT_BLOCK_POWER

DICT_BLOCK_POWER

Disk block size is 1<< this power

PREFIX_ITEM_SIZE

PREFIX_ITEM_SIZE

Size of an item in the prefix index used to look up words.

If the sub-dir was 65 (ASCII A), and the second char was also ASCII 65, then the corresonding prefix record would be the offset to the first word_id beginning with AA, followed by the number of such AA records.

NUM_PREFIX_LETTERS

NUM_PREFIX_LETTERS

Number of possible prefix records (number of possible values for second char of a word id)

PREFIX_HEADER_SIZE

PREFIX_HEADER_SIZE

One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.

MAX_DICT_FILE_HANDLES

MAX_DICT_FILE_HANDLES

Maximum number of simultaneously open file handles

Properties

$dir_name

$dir_name : string

Folder name to use for this IndexDictionary

Type

string

$fhs

$fhs : resource

Array of file handle for files in the dictionary. Members are used to read files to look up words.

Type

resource

$file_lens

$file_lens : integer

Array of file lengths for files in the dictionary. Use so don't try to seek past end of files

Type

integer

$blocks

$blocks : array

An cached array of disk blocks for an index dictionary that has not been completely loaded into memory.

Type

array

$max_tier

$max_tier : integer

The highest tiered index in the IndexDictionary

Type

integer

$read_tier

$read_tier : integer

Tier currently being used to read dictionary data from

Type

integer

$active_tiers

$active_tiers : array

Tiers which currently have data for reading

Type

array

$shard_doc_lens

$shard_doc_lens : array

Length of the doc strings for each of the shards that have been added to the dictionary.

Type

array

$parent_archive_bundle

$parent_archive_bundle : object

If not null, then the parent IndexArchiveBundle this dictionary belongs to

Type

object

Methods

__construct()

__construct(string  $dir_name, object  $parent_archive_bundle = null) 

Makes an index dictionary with the given name

Parameters

string $dir_name

the directory name to store the index dictionary in

object $parent_archive_bundle

parent index archive bundle this dictionary is for

calculateActiveTiers()

calculateActiveTiers() : array

Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.

Returns

array —

active dictionary tiers which may be added to by and ongoing crawl

makePrefixLetters()

makePrefixLetters(string  $dir_name) 

Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.

Parameters

string $dir_name

base directory in which these sub-directories should be made

addShardDictionary()

addShardDictionary(object  $index_shard, object  $callback = null) 

Adds the words in the provided IndexShard to the dictionary.

Merges tiers as needed.

Parameters

object $index_shard

the shard to add the word to the dictionary with

object $callback

object with join function to be called if process is taking too long

mergeTier()

mergeTier(integer  $tier, string  $out_slot) 

Merges for each first letter subdirectory, the $tier pair of files of dictinary words. The output is stored in $out_slot.

Parameters

integer $tier

tier level to perform the merge of files at

string $out_slot

either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

mergeTierFiles()

mergeTierFiles(integer  $prefix, integer  $tier, string  $out_slot) 

For a fixed prefix directory merges the $tier pair of files of dictinary words. The output is stored in $out_slot.

Parameters

integer $prefix

which prefix directory to perform the merge of files

integer $tier

tier level to perform the merge of files at

string $out_slot

either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

combineDictionaryRecord()

combineDictionaryRecord(string  $record_a, string  $record_b, integer  $prefix_bit) : string

Used to combine the dictionary records for a given word_id between that come from two different tier files

Parameters

string $record_a

a dictionary record including auxiliary records from the 'a'th file of the give tier

string $record_b

a dictionary record including auxiliary records from the 'b'th file of the give tier

integer $prefix_bit

either 0 or 32768. The first bit of an auxiliary record should be negation of higher order bit of the given prefix letter used by the tier file.

Returns

string —

a single record with merged strings making use of auxliary records as needed containing (generation, posting list offset, length) information.

decodeAuxRecord()

decodeAuxRecord(string  $record_string, string  $offset) : array

Used to decode an auxiliary dictionary record associated with a given word_id

Parameters

string $record_string

string in which dictionary records occur

string $offset

a byte offset into $record_string

Returns

array —

of up to three strings

extractPrefixRecord()

extractPrefixRecord(string  $prefix_string, integer  $record_num) : array

Returns the $record_num'th prefix record from $prefix_string

Parameters

string $prefix_string

string to get record from

integer $record_num

which record to extract

Returns

array —

$offset, $count array

makePrefixRecord()

makePrefixRecord(integer  $offset, integer  $count) : string

Makes a prefix record string out of an offset and count (packs and concatenates).

Parameters

integer $offset

byte offset into words for the prefix record

integer $count

number of word with that prefix

Returns

string —

the packed record

mergeAllTiers()

mergeAllTiers(object  $callback = null, integer  $max_tier = -1, boolean  $fast_merge_all = false) 

Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.

Parameters

object $callback

object with join function to be called if process is taking too long

integer $max_tier

the maximum tier to merge to merge till -- if not set then $this->max_tier used. Otherwise, one would typically set to a value bigger than $this->max_tier

boolean $fast_merge_all

if true then merge away B slots but don't merge everything to a top tier

getWordInfo()

getWordInfo(string  $word_id, boolean  $raw = false, integer  $threshold = -1, integer  $start_generation = -1, integer  $num_distinct_generations = -1, boolean  $with_remaining_total = false) : mixed

For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.

Parameters

string $word_id

id of the word or phrase one wants to look up

boolean $raw

whether the id is our version of base64 encoded or not

integer $threshold

if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches

integer $start_generation

which index shard in inverted index to start search from

integer $num_distinct_generations

how many shard to consider after $start_generation

boolean $with_remaining_total

Returns

mixed —

an array of entries of the form generation, first offset, last offset, count, matched_key If also have with remaining true, then get a pair, with second element as above and first element the estimated total number of of docs

getWordInfoTier()

getWordInfoTier(string  $word_id, boolean  $raw, integer  $tier, integer  $threshold = -1, integer  $start_generation = -1, integer  $num_distinct_generations = -1) : mixed

This method facilitates query processing of an ongoing crawl.

During an ongoing crawl, the dictionary is arranged into tiers as per the logarithmic merge algortihm rather than just one tier as in a crawl that has been stopped. Word info for more recently crawled pages will tend to be in lower tiers than data that was crawled earlier. getWordInfoTier gets word info data for a specific tier in the index dictionary. Each tier will have word info for a specific, disjoint set of shards, so the format of how to look up posting lists in a shard can be the same regardless of the tier: an array entry is of the form generation, first offset, last offset, and number of documents the word occurred in for this shard.

Parameters

string $word_id

id of the word one wants to look up

boolean $raw

whether the id is our version of base64 encoded or not

integer $tier

which tier to get word info from

integer $threshold

if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches

integer $start_generation

if positive the first generation to return information about

integer $num_distinct_generations

if positive number of then determines the number of generations after the starting generation to return information about

Returns

mixed —

a pair(total_count, max_found_generation, an array of entries of the form generation, first offset, last offset, count, matched_key) or false if no data

addAuxInfoRecords()

addAuxInfoRecords(string  $id, integer  $file_num, integer  $num_aux_records, \seekquarry\yioop\library\int&  $total_count, integer  $threshold, \seekquarry\yioop\library\array&  $info, \seekquarry\yioop\library\int&  $previous_generation, \seekquarry\yioop\library\int&  $num_generations, integer  $offset, integer  $num_distinct_generations, \seekquarry\yioop\library\int&  $max_retained_generation, \seekquarry\yioop\library\array&  $id_info) 

Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.

A typical dictionary entry consists of a 20 byte word id, followed by the 4 bytes ints generation, offset, and length of the posting lists in that generation. If the high bit of the prefix characters in the word id are flipped, it indicates the presence of auxiliary records for that word id. In which case bytes 1, and 2 of the generation, code the number of auxiliary records there will be for this word id. An auxiliary record is 32 bytes long beginning with a bit of the current high prefix letter, followed by a 15 bit code of which aux record in the sequence of aux records for this word id it is, followed by three 10 byte 2byte generation, 4 byte offset, 4 byte len records.

Parameters

string $id

word id to add aux records for

integer $file_num

which prefix file to read from (always reads a file at the max_tier level)

integer $num_aux_records
\seekquarry\yioop\library\int& $total_count
integer $threshold
\seekquarry\yioop\library\array& $info
\seekquarry\yioop\library\int& $previous_generation
\seekquarry\yioop\library\int& $num_generations
integer $offset
integer $num_distinct_generations
\seekquarry\yioop\library\int& $max_retained_generation
\seekquarry\yioop\library\array& $id_info

formatWordInfo()

formatWordInfo(\seekquarry\yioop\library\int&  $total_count, integer  $max_retained_generation, array  $info) : array

Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple

Parameters

\seekquarry\yioop\library\int& $total_count
integer $max_retained_generation
array $info

Returns

array —

resulting triple

addLookedUpEntry()

addLookedUpEntry(string  $id, string  $word_id, array  $record, \seekquarry\yioop\library\array&  $info, \seekquarry\yioop\library\int&  $total_count, \seekquarry\yioop\library\int&  $previous_generation, \seekquarry\yioop\library\int&  $previous_id, \seekquarry\yioop\library\int&  $num_generations, integer  $num_distinct_generations, \seekquarry\yioop\library\int&  $max_retained_generation, \seekquarry\yioop\library\array&  $id_info) 

This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.

Parameters

string $id

of a row to compare $word_id against

string $word_id

the word id of a term or phrase we are computing the quadruple array for

array $record

current record from dictionary that we may or may not add to info

\seekquarry\yioop\library\array& $info

quadruple array we are adding to

\seekquarry\yioop\library\int& $total_count

count of items in $info

\seekquarry\yioop\library\int& $previous_generation

last generation added to $info

\seekquarry\yioop\library\int& $previous_id

last exact if added to $info

\seekquarry\yioop\library\int& $num_generations
integer $num_distinct_generations
\seekquarry\yioop\library\int& $max_retained_generation
\seekquarry\yioop\library\array& $id_info

getDictSubstring()

getDictSubstring(integer  $file_num, integer  $offset, integer  $len) : string

Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary

Parameters

integer $file_num

which prefix file to read from (always reads a file at the max_tier level)

integer $offset

byte offset to start reading from

integer $len

number of bytes to read

Returns

string —

data from that location in the shard

readBlockDictAtOffset()

readBlockDictAtOffset(integer  $file_num, integer  $bytes) : \seekquarry\yioop\library\&string

Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes

Parameters

integer $file_num

which dictionary file (given by first letter prefix) to read from

integer $bytes

byte offset to start reading from

Returns

\seekquarry\yioop\library\&string —

data fromIndexShard file