\seekquarry\yioop\libraryIndexShard

Data structure used to store one generation worth of the word document index (inverted index).

This data structure consists of three main components a word entries, word_doc entries, and document entries.

Word entries are described in the documentation for the words field.

Word-doc entries are described in the documentation for the word_docs field

Document entries are described in the documentation for the doc_infos field

IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.

Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.

Summary

Methods

Properties

Constants

__construct()
load()
save()
checkSave()
addDocumentWords()
getWordInfo()
getWordString()
getPostingsSlice()
numDocsOrLinks()
makeItem()
weightedCount()
computeProximity()
docStats()
getPostingAtOffset()
getDocIndexOfPostingAtOffset()
nextPostingOffsetDocOffset()
gallopPostingOffsetDocOffset()
docOffsetFromPostingOffset()
getPostingsSliceById()
appendIndexShard()
mergeWordPostingsToString()
changeDocumentOffsets()
prepareWordsAndPrefixes()
packWords()
outputPostingLists()
unpackWordDocs()
getWordDocsSubstring()
getWordDocsWord()
getDocInfoSubstring()
getShardSubstring()
getShardWord()
readBlockShardAtOffset()
getShardHeader()
packDoclenNum()
unpackDoclenNum()
getWordInfoFromString()
headerToShardFields()
makeWords()

$filename
$unsaved_operations
$save_frequency
$doc_infos
$docids_len
$word_docs
$word_docs_len
$words
$words_len
$prefixes
$prefixes_len
$generation
$num_docs_per_generation
$num_docs
$num_docs_word
$num_link_docs
$len_all_docs
$len_all_link_docs
$fh
$blocks
$read_only_from_disk
$word_docs_packed
$file_len
$last_flattened_words_count
$word_postings

DEFAULT_SAVE_FREQUENCY
FLATTEN_FREQUENCY
WORD_POSTING_COPY_LEN
LINK_FLAG
SHARD_BLOCK_POWER
SHARD_BLOCK_SIZE
HEADER_LENGTH
WORD_DATA_LEN
WORD_KEY_LEN
DOC_KEY_LEN
DOC_ID_LEN
POSTING_LEN
BLANK
HALF_BLANK
STORE_FLAG

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/IndexShard.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\PersistentStructure

\seekquarry\yioop\library\IndexShard
Implements: \seekquarry\yioop\library\CrawlConstants

Constants

DEFAULT_SAVE_FREQUENCY

DEFAULT_SAVE_FREQUENCY

If not specified in the constructor, this will be the number of operations between saves

FLATTEN_FREQUENCY

FLATTEN_FREQUENCY

Fraction of NUM_DOCS_PER_GENERATION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)

WORD_POSTING_COPY_LEN

WORD_POSTING_COPY_LEN

Bytes of tmp string allowed during flattenings

LINK_FLAG

LINK_FLAG

Used to keep track of whether a record in document infos is for a document or for a link

SHARD_BLOCK_POWER

SHARD_BLOCK_POWER

Shard block size is 1<< this power

SHARD_BLOCK_SIZE

SHARD_BLOCK_SIZE

Size in bytes of one block in IndexShard

HEADER_LENGTH

HEADER_LENGTH

Header Length of an IndexShard (sum of its non-variable length fields)

WORD_DATA_LEN

WORD_DATA_LEN

Length of the data portion of a word entry in bytes in the shard

WORD_KEY_LEN

WORD_KEY_LEN

Length of a word entry's key in bytes

DOC_KEY_LEN

DOC_KEY_LEN

Length of a key in a DOC ID.

DOC_ID_LEN

DOC_ID_LEN

Length of DOC ID.

POSTING_LEN

POSTING_LEN

Length of one posting ( a doc offset occurrence pair) in a posting list

BLANK

BLANK

Represents an empty prefix item

HALF_BLANK

HALF_BLANK

Flag used to indicate that a word item should not be packed or unpacked

STORE_FLAG

STORE_FLAG

Represents an empty prefix item

Properties

$filename

$filename : string

Name of the file in which to store the PersistentStructure

Type

string

$unsaved_operations

$unsaved_operations : integer

Number of operations since the last save

Type

integer

$save_frequency

$save_frequency : integer

Number of operation between saves. If == -1 never save using checkSave

Type

integer

$doc_infos

$doc_infos : string

Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.

In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".

Type

string

$docids_len

$docids_len : integer

Length of $doc_infos as a string

Type

integer

$word_docs

$word_docs : string

This string is non-empty when shard is loaded and in its packed state.

It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.

Type

string

$word_docs_len

$word_docs_len : integer

Length of $word_docs as a string

Type

integer

$words

$words : array

Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP

Type

array

$words_len

$words_len : integer

Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode

Type

integer

$prefixes

$prefixes : array

An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.

Type

array

$prefixes_len

$prefixes_len : integer

Length of the prefix index into the dictionary of the shard

Type

integer

$generation

$generation : integer

This is supposed to hold the number of earlier shards, prior to the current shard.

Type

integer

$num_docs_per_generation

$num_docs_per_generation : integer

This is supposed to hold the number of documents that a given shard can hold.

Type

integer

$num_docs

$num_docs : integer

Number of documents (not links) stored in this shard

Type

integer

$num_docs_word

$num_docs_word : array

Keeps track of the number of documents a word is in

Type

array

$num_link_docs

$num_link_docs : integer

Number of links (not documents) stored in this shard

Type

integer

$len_all_docs

$len_all_docs : integer

Number of words stored in total in all documents in this shard

Type

integer

$len_all_link_docs

$len_all_link_docs : integer

Number of words stored in total in all links in this shard

Type

integer

$fh

$fh : resource

File handle for a shard if we are going to use it in read mode and not completely load it.

Type

resource

$blocks

$blocks : array

An cached array of disk blocks for an index shard that has not been completely loaded into memory.

Type

array

$read_only_from_disk

$read_only_from_disk : boolean

Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Type

boolean

$word_docs_packed

$word_docs_packed : boolean

Keeps track of the packed/unpacked state of the word_docs list

Type

boolean

$file_len

$file_len : integer

Keeps track of the length of the shard as a file

Type

integer

$last_flattened_words_count

$last_flattened_words_count :

Number of document inserts since the last time word data was flattened to the word_postings string.

Type

$word_postings

$word_postings : string

Used to hold word_id, posting_len, posting triples as a memory efficient string

Type

string

Methods

__construct()

__construct(string  $fname, integer  $generation, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, boolean  $read_only_from_disk = false)

Makes an index shard with the given file name and generation offset

Parameters

string	$fname	filename to store the index shard with
integer	$generation	when returning documents from the shard pretend there ar ethis many earlier documents
integer	$num_docs_per_generation	the number of documents that a given shard can hold.
boolean	$read_only_from_disk	used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

load()

load(string  $fname, \seekquarry\yioop\library\string&  $data = null) : object

Load an IndexShard from a file or string

Parameters

string	$fname	the name of the file to the IndexShard from/to
\seekquarry\yioop\library\string&	$data	stringified shard data to load shard from. If null then the data is loaded from the $fname if possible

Returns

object —

the IndexShard loaded

save()

save(boolean  $to_string = false, boolean  $with_logging = false) : string

Save the IndexShard to its filename

Parameters

boolean	$to_string	whether output should be written to a string rather than the default file location
boolean	$with_logging	whether log messages should be written as the shard save progresses

Returns

string —

serialized shard if output was to string else empty string

checkSave()

checkSave()

Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage

addDocumentWords()

addDocumentWords(string  $doc_keys, integer  $summary_offset, array  $word_lists, array  $meta_ids = array(), array  $materialized_metas = array(), boolean  $is_doc = false, mixed  $rank = false) : boolean

Add a new document to the index shard with the given summary offset.

Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.

Parameters

string	$doc_keys	a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.
integer	$summary_offset	its offset into the word archive the document's data is stored in
array	$word_lists	(word => array of word positions in doc)
array	$meta_ids	meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.
array	$materialized_metas
boolean	$is_doc	flag used to indicate if what is being sored is a document or a link to a document
mixed	$rank	either false if not used, or a 4 bit estimate of the rank of this document item

Returns

boolean —

success or failure of performing the add

getWordInfo()

getWordInfo(string  $word_id, boolean  $raw = false, integer  $shift, string  $mask = "") : array

Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.

Parameters

string	$word_id	id of the word one wants to look up
boolean	$raw	whether the id is our version of base64 encoded or not
integer	$shift	how many low order bits to drop from $word_id's when checking for a match
string	$mask	if $hash is for a word, after the 9th byte what meta word mask should be applied to the 20 byte hash

Returns

array —

first offset, last offset, count, exact matching id ( recall match can ignore low order shift bits)

getWordString()

getWordString(boolean  $is_disk, integer  $start, integer  $location, integer  $word_item_len)

Return word record (word key + posting lookup data )from the shard from the shard posting list

Parameters

boolean	$is_disk	whether the shard is on disk or in memory
integer	$start	offset to start of the dictionary
integer	$location	index of record to extract from dictionary
integer	$word_item_len	length of a word + data record

getPostingsSlice()

getPostingsSlice(integer  $start_offset, \seekquarry\yioop\library\int&  $next_offset, integer  $last_offset, integer  $len) : array

Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.

Parameters

integer	$start_offset	of the current posting list for query term used in calculating BM25F.
\seekquarry\yioop\library\int&	$next_offset	where to start in word docs
integer	$last_offset	offset at which to stop by
integer	$len	number of documents desired

Returns

array —

desired list of doc's and their info

numDocsOrLinks()

numDocsOrLinks(integer  $start_offset, integer  $last_offset, float  $avg_posting_len = 4) : integer

An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.

Parameters

integer	$start_offset	starting location in posting list
integer	$last_offset	ending location in posting list
float	$avg_posting_len	number of bytes in an average posting

Returns

integer —

number of docs or links

makeItem()

makeItem(string  $posting, integer  $num_doc_or_links, integer  $occurs) : array

Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.

Returns the doc_id of the document

Parameters

string	$posting	a posting entry from some words posting list
integer	$num_doc_or_links	number of documents or links doc appears in
integer	$occurs	number of occurrences of the current word in the document. If nonzero, this overrides the number of occurrences in various parts of a document that would be determined by its position list. Typically, would only override for meta words.

Returns

array —

($doc_id, posting_stats_array) for posting

weightedCount()

weightedCount(array  $position_list, boolean  $is_doc) : array

Used to sum over the occurences in a position list counting with weight based on term location in the document

Parameters

array	$position_list	positions of term in item
boolean	$is_doc	whether the item is a document or a link

Returns

array —

asscoiative array of document_part => weight count of occurrences of term in

computeProximity()

computeProximity(array  $position_list, boolean  $is_doc) : integer

Returns a proximity score for a single term based on its location in doc.

Parameters

array	$position_list	locations of term within item
boolean	$is_doc	whether the item is a document or not

Returns

integer —

a score for proximity

docStats()

docStats(\seekquarry\yioop\library\array&  $item, integer  $occurrences, integer  $doc_len, integer  $num_doc_or_links, float  $average_doc_len, integer  $num_docs, integer  $total_docs_or_links, float  $type_weight)

Computes BM25F relevance and a score for the supplied item based on the supplied parameters.

Parameters

\seekquarry\yioop\library\array&	$item	doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed
integer	$occurrences	number of occurences of the term in the item
integer	$doc_len	number of words in doc item represents
integer	$num_doc_or_links	number of links or docs containing the term
float	$average_doc_len	average length of items in corpus
integer	$num_docs	either number of links or number of docs depending if item represents a link or a doc.
integer	$total_docs_or_links	number of docs or links in corpus
float	$type_weight	BM25F weight for this component (doc or link) of score

getPostingAtOffset()

getPostingAtOffset(integer  $current, \seekquarry\yioop\library\int&  $posting_start, \seekquarry\yioop\library\int&  $posting_end) : string

Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting

Parameters

integer	$current	an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN
\seekquarry\yioop\library\int&	$posting_start	after function call will be index of start of nearest posting to current
\seekquarry\yioop\library\int&	$posting_end	after function call will be index of end of nearest posting to current

Returns

string —

the substring of word_docs corresponding to the posting

getDocIndexOfPostingAtOffset()

getDocIndexOfPostingAtOffset(integer  $current) : integer

Returns the document index of the posting at offset $current in word_docs

Parameters

integer

$current

an offset into the posting lists (word_docs)

Returns

integer —

the doc index of the pointed to posting

nextPostingOffsetDocOffset()

nextPostingOffsetDocOffset(integer  $start_offset, integer  $end_offset, integer  $doc_offset) : array

Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).

Parameters

integer	$start_offset	first posting to consider
integer	$end_offset	last posting before give up
integer	$doc_offset	document offset we want to be greater than or equal to

Returns

array —

(int offset to next posting, doc_offset for this post)

gallopPostingOffsetDocOffset()

gallopPostingOffsetDocOffset(\seekquarry\yioop\library\int&  $current, integer  $doc_index, integer  $end) : integer

Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found

Parameters

\seekquarry\yioop\library\int&	$current	current posting offset into posting list
integer	$doc_index	document index want bigger than or equal to
integer	$end	last index of posting list

Returns

integer —

document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end.

docOffsetFromPostingOffset()

docOffsetFromPostingOffset(integer  $offset) : integer

Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.

Parameters

integer

$offset

byte/char offset into the word_docs string

Returns

integer —

a document byte/char offset into the doc_infos string

getPostingsSliceById()

getPostingsSliceById(string  $word_id, integer  $len) : array

Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)

Parameters

string	$word_id	key to look up documents for
integer	$len	number of documents

Returns

array —

desired list of doc's and their info

appendIndexShard()

appendIndexShard(object  $index_shard)

Adds the contents of the supplied $index_shard to the current index shard

Parameters

object

$index_shard

the shard to append to the current shard

mergeWordPostingsToString()

mergeWordPostingsToString(boolean  $replace = false)

Used to flatten the words associative array to a more memory efficient word_postings string.

$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory

$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.

mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.

Parameters

boolean

$replace

whether to overwrite existing word_id postings (true) or to append (false)

changeDocumentOffsets()

changeDocumentOffsets(array  $docid_offsets)

Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).

Parameters

array

$docid_offsets

a set of doc_id associated with a new_doc_offset.

prepareWordsAndPrefixes()

prepareWordsAndPrefixes(boolean  $with_logging = false)

Computes the prefix string index for the current words array.

This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings

Parameters

boolean

$with_logging

whether log messages should be written as progresses

packWords()

packWords(resource  $fh = null, boolean  $with_logging = false)

Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string help by word_postings. packWords separates words from postings.

After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.

Parameters

resource	$fh	a file handle to write the dictionary to, if desired
boolean	$with_logging	whether to write progress log messages every 30 seconds

outputPostingLists()

outputPostingLists(resource  $fh = null, boolean  $with_logging = false)

Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.

Parameters

resource	$fh	a filehandle to write to
boolean	$with_logging	whether to log progress

unpackWordDocs()

unpackWordDocs()

Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.

This method is memory expensive as it briefly has essentially two copies of what's in word_docs.

getWordDocsSubstring()

getWordDocsSubstring(  $offset,   $len) : \seekquarry\yioop\library\desired

From disk gets $len many bytes starting from $offset in the word_docs strings

Parameters

	$offset	byte offset to begin getting data out of disk-based word_docs
	$len	number of bytes to get

Returns

\seekquarry\yioop\library\desired —

string

getWordDocsWord()

getWordDocsWord(integer  $offset)

Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the sahrd

Parameters

integer

$offset

a byte offset into the word_docs string

getDocInfoSubstring()

getDocInfoSubstring(  $offset,   $len) : \seekquarry\yioop\library\desired

From disk gets $len many bytes starting from $offset in the doc_infos strings

Parameters

	$offset	byte offset to begin getting data out of disk-based doc_infos
	$len	number of bytes to get

Returns

\seekquarry\yioop\library\desired —

string

getShardSubstring()

getShardSubstring(integer  $offset, integer  $len, boolean  $cache = true) : string

Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard

Parameters

integer	$offset	byte offset to start reading from
integer	$len	number of bytes to read
boolean	$cache	whether to cache disk blocks read from disk

Returns

string —

data from that location in the shard

getShardWord()

getShardWord(integer  $offset) : integer

Reads 32 bit word as an unsigned int from the offset given in the shard

Parameters

integer

$offset

a byte offset into the shard

Returns

integer —

desired word or false

readBlockShardAtOffset()

readBlockShardAtOffset(integer  $bytes, boolean  $cache = true) : \seekquarry\yioop\library\&string

Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes

Parameters

integer	$bytes	byte offset to start reading from
boolean	$cache	whether to cache disk blocks that have been read to RAM

Returns

\seekquarry\yioop\library\&string —

data fromIndexShard file

getShardHeader()

getShardHeader() : boolean

If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)

Returns

boolean —

whether was able to read in or not

packDoclenNum()

packDoclenNum(integer  $doc_len, integer  $num_keys) : string

Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)

Parameters

integer	$doc_len	number of words in the document
integer	$num_keys	number of keys that are used to make up its doc_id

Returns

string —

packed int string representing these two values

unpackDoclenNum()

unpackDoclenNum(integer  $doc_info) : array

Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id

Parameters

integer

$doc_info

integer to unpack

Returns

array —

pair (number of words in the document, number of keys that are used to make up its doc_id)

getWordInfoFromString()

getWordInfoFromString(string  $str, boolean  $include_generation = false) : array

Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.

Parameters

string	$str
boolean	$include_generation

Returns

array —

of these three or four int's

headerToShardFields()

headerToShardFields(string  $header, object  $shard)

Split a header string into a shards field variable

Parameters

string	$header	a string with packed shard header data
object	$shard	IndexShard to put data into

makeWords()

makeWords(\seekquarry\yioop\library\string&  $value, integer  $key, object  $shard)

Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.

Parameters

\seekquarry\yioop\library\string&	$value	the word_key . word_info string
integer	$key	index in array - we don't use
object	$shard	IndexShard to add the entry to word table for