DEFAULT_SAVE_FREQUENCY
DEFAULT_SAVE_FREQUENCY
If not specified in the constructor, this will be the number of operations between saves
Data structure used to store one generation worth of the word document index (inverted index).
This data structure consists of three main components a word entries, word_doc entries, and document entries.
Word entries are described in the documentation for the words field.
Word-doc entries are described in the documentation for the word_docs field
Document entries are described in the documentation for the doc_infos field
IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.
Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.
$doc_infos : string
Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.
In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".
$word_docs : string
This string is non-empty when shard is loaded and in its packed state.
It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.
$words : array
Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
__construct(string $fname, integer $generation, integer $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, boolean $read_only_from_disk = false)
Makes an index shard with the given file name and generation offset
string | $fname | filename to store the index shard with |
integer | $generation | when returning documents from the shard pretend there ar ethis many earlier documents |
integer | $num_docs_per_generation | the number of documents that a given shard can hold. |
boolean | $read_only_from_disk | used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable. |
load(string $fname, \seekquarry\yioop\library\string& $data = null) : object
Load an IndexShard from a file or string
string | $fname | the name of the file to the IndexShard from/to |
\seekquarry\yioop\library\string& | $data | stringified shard data to load shard from. If null then the data is loaded from the $fname if possible |
the IndexShard loaded
save(boolean $to_string = false, boolean $with_logging = false) : string
Save the IndexShard to its filename
boolean | $to_string | whether output should be written to a string rather than the default file location |
boolean | $with_logging | whether log messages should be written as the shard save progresses |
serialized shard if output was to string else empty string
checkSave()
Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage
Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293addDocumentWords(string $doc_keys, integer $summary_offset, array $word_lists, array $meta_ids = array(), array $materialized_metas = array(), boolean $is_doc = false, mixed $rank = false) : boolean
Add a new document to the index shard with the given summary offset.
Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.
string | $doc_keys | a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document. |
integer | $summary_offset | its offset into the word archive the document's data is stored in |
array | $word_lists | (word => array of word positions in doc) |
array | $meta_ids | meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document. |
array | $materialized_metas | |
boolean | $is_doc | flag used to indicate if what is being sored is a document or a link to a document |
mixed | $rank | either false if not used, or a 4 bit estimate of the rank of this document item |
success or failure of performing the add
getWordInfo(string $word_id, boolean $raw = false, integer $shift, string $mask = "") : array
Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
string | $word_id | id of the word one wants to look up |
boolean | $raw | whether the id is our version of base64 encoded or not |
integer | $shift | how many low order bits to drop from $word_id's when checking for a match |
string | $mask | if $hash is for a word, after the 9th byte what meta word mask should be applied to the 20 byte hash |
first offset, last offset, count, exact matching id ( recall match can ignore low order shift bits)
getWordString(boolean $is_disk, integer $start, integer $location, integer $word_item_len)
Return word record (word key + posting lookup data )from the shard from the shard posting list
boolean | $is_disk | whether the shard is on disk or in memory |
integer | $start | offset to start of the dictionary |
integer | $location | index of record to extract from dictionary |
integer | $word_item_len | length of a word + data record |
getPostingsSlice(integer $start_offset, \seekquarry\yioop\library\int& $next_offset, integer $last_offset, integer $len) : array
Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
integer | $start_offset | of the current posting list for query term used in calculating BM25F. |
\seekquarry\yioop\library\int& | $next_offset | where to start in word docs |
integer | $last_offset | offset at which to stop by |
integer | $len | number of documents desired |
desired list of doc's and their info
numDocsOrLinks(integer $start_offset, integer $last_offset, float $avg_posting_len = 4) : integer
An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
integer | $start_offset | starting location in posting list |
integer | $last_offset | ending location in posting list |
float | $avg_posting_len | number of bytes in an average posting |
number of docs or links
makeItem(string $posting, integer $num_doc_or_links, integer $occurs) : array
Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.
Returns the doc_id of the document
string | $posting | a posting entry from some words posting list |
integer | $num_doc_or_links | number of documents or links doc appears in |
integer | $occurs | number of occurrences of the current word in the document. If nonzero, this overrides the number of occurrences in various parts of a document that would be determined by its position list. Typically, would only override for meta words. |
($doc_id, posting_stats_array) for posting
weightedCount(array $position_list, boolean $is_doc) : array
Used to sum over the occurences in a position list counting with weight based on term location in the document
array | $position_list | positions of term in item |
boolean | $is_doc | whether the item is a document or a link |
asscoiative array of document_part => weight count of occurrences of term in
computeProximity(array $position_list, boolean $is_doc) : integer
Returns a proximity score for a single term based on its location in doc.
array | $position_list | locations of term within item |
boolean | $is_doc | whether the item is a document or not |
a score for proximity
docStats(\seekquarry\yioop\library\array& $item, integer $occurrences, integer $doc_len, integer $num_doc_or_links, float $average_doc_len, integer $num_docs, integer $total_docs_or_links, float $type_weight)
Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
\seekquarry\yioop\library\array& | $item | doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed |
integer | $occurrences |
|
integer | $doc_len | number of words in doc item represents |
integer | $num_doc_or_links | number of links or docs containing the term |
float | $average_doc_len | average length of items in corpus |
integer | $num_docs | either number of links or number of docs depending if item represents a link or a doc. |
integer | $total_docs_or_links | number of docs or links in corpus |
float | $type_weight | BM25F weight for this component (doc or link) of score |
getPostingAtOffset(integer $current, \seekquarry\yioop\library\int& $posting_start, \seekquarry\yioop\library\int& $posting_end) : string
Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
integer | $current | an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN |
\seekquarry\yioop\library\int& | $posting_start | after function call will be index of start of nearest posting to current |
\seekquarry\yioop\library\int& | $posting_end | after function call will be index of end of nearest posting to current |
the substring of word_docs corresponding to the posting
getDocIndexOfPostingAtOffset(integer $current) : integer
Returns the document index of the posting at offset $current in word_docs
integer | $current | an offset into the posting lists (word_docs) |
the doc index of the pointed to posting
nextPostingOffsetDocOffset(integer $start_offset, integer $end_offset, integer $doc_offset) : array
Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
integer | $start_offset | first posting to consider |
integer | $end_offset | last posting before give up |
integer | $doc_offset | document offset we want to be greater than or equal to |
(int offset to next posting, doc_offset for this post)
gallopPostingOffsetDocOffset(\seekquarry\yioop\library\int& $current, integer $doc_index, integer $end) : integer
Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found
\seekquarry\yioop\library\int& | $current | current posting offset into posting list |
integer | $doc_index | document index want bigger than or equal to |
integer | $end | last index of posting list |
document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end.
docOffsetFromPostingOffset(integer $offset) : integer
Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
integer | $offset | byte/char offset into the word_docs string |
a document byte/char offset into the doc_infos string
getPostingsSliceById(string $word_id, integer $len) : array
Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
string | $word_id | key to look up documents for |
integer | $len | number of documents |
desired list of doc's and their info
appendIndexShard(object $index_shard)
Adds the contents of the supplied $index_shard to the current index shard
object | $index_shard | the shard to append to the current shard |
mergeWordPostingsToString(boolean $replace = false)
Used to flatten the words associative array to a more memory efficient word_postings string.
$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory
$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.
mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.
boolean | $replace | whether to overwrite existing word_id postings (true) or to append (false) |
changeDocumentOffsets(array $docid_offsets)
Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
array | $docid_offsets | a set of doc_id associated with a new_doc_offset. |
prepareWordsAndPrefixes(boolean $with_logging = false)
Computes the prefix string index for the current words array.
This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings
boolean | $with_logging | whether log messages should be written as progresses |
packWords(resource $fh = null, boolean $with_logging = false)
Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string help by word_postings. packWords separates words from postings.
After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.
resource | $fh | a file handle to write the dictionary to, if desired |
boolean | $with_logging | whether to write progress log messages every 30 seconds |
outputPostingLists(resource $fh = null, boolean $with_logging = false)
Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
resource | $fh | a filehandle to write to |
boolean | $with_logging | whether to log progress |
unpackWordDocs()
Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.
This method is memory expensive as it briefly has essentially two copies of what's in word_docs.
Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293getWordDocsSubstring( $offset, $len) : \seekquarry\yioop\library\desired
From disk gets $len many bytes starting from $offset in the word_docs strings
$offset | byte offset to begin getting data out of disk-based word_docs |
|
$len | number of bytes to get |
string
getWordDocsWord(integer $offset)
Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the sahrd
integer | $offset | a byte offset into the word_docs string |
getDocInfoSubstring( $offset, $len) : \seekquarry\yioop\library\desired
From disk gets $len many bytes starting from $offset in the doc_infos strings
$offset | byte offset to begin getting data out of disk-based doc_infos |
|
$len | number of bytes to get |
string
getShardSubstring(integer $offset, integer $len, boolean $cache = true) : string
Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
integer | $offset | byte offset to start reading from |
integer | $len | number of bytes to read |
boolean | $cache | whether to cache disk blocks read from disk |
data from that location in the shard
getShardWord(integer $offset) : integer
Reads 32 bit word as an unsigned int from the offset given in the shard
integer | $offset | a byte offset into the shard |
desired word or false
readBlockShardAtOffset(integer $bytes, boolean $cache = true) : \seekquarry\yioop\library\&string
Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
integer | $bytes | byte offset to start reading from |
boolean | $cache | whether to cache disk blocks that have been read to RAM |
data fromIndexShard file
getShardHeader() : boolean
If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293whether was able to read in or not
packDoclenNum(integer $doc_len, integer $num_keys) : string
Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
integer | $doc_len | number of words in the document |
integer | $num_keys | number of keys that are used to make up its doc_id |
packed int string representing these two values
unpackDoclenNum(integer $doc_info) : array
Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
integer | $doc_info | integer to unpack |
pair (number of words in the document, number of keys that are used to make up its doc_id)
getWordInfoFromString(string $str, boolean $include_generation = false) : array
Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
string | $str | |
boolean | $include_generation |
of these three or four int's
headerToShardFields(string $header, object $shard)
Split a header string into a shards field variable
string | $header | a string with packed shard header data |
object | $shard | IndexShard to put data into |
makeWords(\seekquarry\yioop\library\string& $value, integer $key, object $shard)
Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
\seekquarry\yioop\library\string& | $value | the word_key . word_info string |
integer | $key | index in array - we don't use |
object | $shard | IndexShard to add the entry to word table for |