\seekquarry\yioop\libraryIndexShard

Data structure used to store one generation worth of the word document index (inverted index).

This data structure consists of three main components a word entries, word_doc entries, and document entries.

Word entries are described in the documentation for the words field.

Word-doc entries are described in the documentation for the word_docs field

Document entries are described in the documentation for the doc_infos field

IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.

Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.

Summary

Methods
Properties
Constants
__construct()
load()
save()
checkSave()
addDocumentWords()
getWordInfo()
getWordString()
getPostingsSlice()
numDocsOrLinks()
makeItem()
weightedCount()
computeProximity()
docStats()
getPostingAtOffset()
getDocIndexOfPostingAtOffset()
nextPostingOffsetDocOffset()
gallopPostingOffsetDocOffset()
docOffsetFromPostingOffset()
getPostingsSliceById()
appendIndexShard()
mergeWordPostingsToString()
changeDocumentOffsets()
prepareWordsAndPrefixes()
packWords()
outputPostingLists()
unpackWordDocs()
getWordDocsSubstring()
getWordDocsWord()
getDocInfoSubstring()
getShardSubstring()
getShardWord()
readBlockShardAtOffset()
getShardHeader()
packDoclenNum()
unpackDoclenNum()
getWordInfoFromString()
headerToShardFields()
makeWords()
$filename
$unsaved_operations
$save_frequency
$doc_infos
$docids_len
$word_docs
$word_docs_len
$words
$words_len
$prefixes
$prefixes_len
$generation
$num_docs_per_generation
$num_docs
$num_docs_word
$num_link_docs
$len_all_docs
$len_all_link_docs
$fh
$blocks
$read_only_from_disk
$word_docs_packed
$file_len
$last_flattened_words_count
$word_postings
DEFAULT_SAVE_FREQUENCY
FLATTEN_FREQUENCY
WORD_POSTING_COPY_LEN
LINK_FLAG
SHARD_BLOCK_POWER
SHARD_BLOCK_SIZE
HEADER_LENGTH
WORD_DATA_LEN
WORD_KEY_LEN
DOC_KEY_LEN
DOC_ID_LEN
POSTING_LEN
BLANK
HALF_BLANK
STORE_FLAG
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

DEFAULT_SAVE_FREQUENCY

DEFAULT_SAVE_FREQUENCY

If not specified in the constructor, this will be the number of operations between saves

FLATTEN_FREQUENCY

FLATTEN_FREQUENCY

Fraction of NUM_DOCS_PER_GENERATION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)

WORD_POSTING_COPY_LEN

WORD_POSTING_COPY_LEN

Bytes of tmp string allowed during flattenings

SHARD_BLOCK_POWER

SHARD_BLOCK_POWER

Shard block size is 1<< this power

SHARD_BLOCK_SIZE

SHARD_BLOCK_SIZE

Size in bytes of one block in IndexShard

HEADER_LENGTH

HEADER_LENGTH

Header Length of an IndexShard (sum of its non-variable length fields)

WORD_DATA_LEN

WORD_DATA_LEN

Length of the data portion of a word entry in bytes in the shard

WORD_KEY_LEN

WORD_KEY_LEN

Length of a word entry's key in bytes

DOC_KEY_LEN

DOC_KEY_LEN

Length of a key in a DOC ID.

DOC_ID_LEN

DOC_ID_LEN

Length of DOC ID.

POSTING_LEN

POSTING_LEN

Length of one posting ( a doc offset occurrence pair) in a posting list

BLANK

BLANK

Represents an empty prefix item

HALF_BLANK

HALF_BLANK

Flag used to indicate that a word item should not be packed or unpacked

STORE_FLAG

STORE_FLAG

Represents an empty prefix item

Properties

$filename

$filename : string

Name of the file in which to store the PersistentStructure

Type

string

$unsaved_operations

$unsaved_operations : integer

Number of operations since the last save

Type

integer

$save_frequency

$save_frequency : integer

Number of operation between saves. If == -1 never save using checkSave

Type

integer

$doc_infos

$doc_infos : string

Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.

In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".

Type

string

$docids_len

$docids_len : integer

Length of $doc_infos as a string

Type

integer

$word_docs

$word_docs : string

This string is non-empty when shard is loaded and in its packed state.

It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.

Type

string

$word_docs_len

$word_docs_len : integer

Length of $word_docs as a string

Type

integer

$words

$words : array

Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP

Type

array

$words_len

$words_len : integer

Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode

Type

integer

$prefixes

$prefixes : array

An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.

Type

array

$prefixes_len

$prefixes_len : integer

Length of the prefix index into the dictionary of the shard

Type

integer

$generation

$generation : integer

This is supposed to hold the number of earlier shards, prior to the current shard.

Type

integer

$num_docs_per_generation

$num_docs_per_generation : integer

This is supposed to hold the number of documents that a given shard can hold.

Type

integer

$num_docs

$num_docs : integer

Number of documents (not links) stored in this shard

Type

integer

$num_docs_word

$num_docs_word : array

Keeps track of the number of documents a word is in

Type

array

$num_link_docs

$num_link_docs : integer

Number of links (not documents) stored in this shard

Type

integer

$len_all_docs

$len_all_docs : integer

Number of words stored in total in all documents in this shard

Type

integer

$len_all_link_docs

$len_all_link_docs : integer

Number of words stored in total in all links in this shard

Type

integer

$fh

$fh : resource

File handle for a shard if we are going to use it in read mode and not completely load it.

Type

resource

$blocks

$blocks : array

An cached array of disk blocks for an index shard that has not been completely loaded into memory.

Type

array

$read_only_from_disk

$read_only_from_disk : boolean

Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Type

boolean

$word_docs_packed

$word_docs_packed : boolean

Keeps track of the packed/unpacked state of the word_docs list

Type

boolean

$file_len

$file_len : integer

Keeps track of the length of the shard as a file

Type

integer

$last_flattened_words_count

$last_flattened_words_count : 

Number of document inserts since the last time word data was flattened to the word_postings string.

Type

$word_postings

$word_postings : string

Used to hold word_id, posting_len, posting triples as a memory efficient string

Type

string

Methods

__construct()

__construct(string  $fname, integer  $generation, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, boolean  $read_only_from_disk = false) 

Makes an index shard with the given file name and generation offset

Parameters

string $fname

filename to store the index shard with

integer $generation

when returning documents from the shard pretend there ar ethis many earlier documents

integer $num_docs_per_generation

the number of documents that a given shard can hold.

boolean $read_only_from_disk

used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

load()

load(string  $fname, \seekquarry\yioop\library\string&  $data = null) : object

Load an IndexShard from a file or string

Parameters

string $fname

the name of the file to the IndexShard from/to

\seekquarry\yioop\library\string& $data

stringified shard data to load shard from. If null then the data is loaded from the $fname if possible

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

object —

the IndexShard loaded

save()

save(boolean  $to_string = false, boolean  $with_logging = false) : string

Save the IndexShard to its filename

Parameters

boolean $to_string

whether output should be written to a string rather than the default file location

boolean $with_logging

whether log messages should be written as the shard save progresses

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

string —

serialized shard if output was to string else empty string

checkSave()

checkSave() 

Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

addDocumentWords()

addDocumentWords(string  $doc_keys, integer  $summary_offset, array  $word_lists, array  $meta_ids = array(), array  $materialized_metas = array(), boolean  $is_doc = false, mixed  $rank = false) : boolean

Add a new document to the index shard with the given summary offset.

Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.

Parameters

string $doc_keys

a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.

integer $summary_offset

its offset into the word archive the document's data is stored in

array $word_lists

(word => array of word positions in doc)

array $meta_ids

meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.

array $materialized_metas
boolean $is_doc

flag used to indicate if what is being sored is a document or a link to a document

mixed $rank

either false if not used, or a 4 bit estimate of the rank of this document item

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

boolean —

success or failure of performing the add

getWordInfo()

getWordInfo(string  $word_id, boolean  $raw = false, integer  $shift, string  $mask = "") : array

Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.

Parameters

string $word_id

id of the word one wants to look up

boolean $raw

whether the id is our version of base64 encoded or not

integer $shift

how many low order bits to drop from $word_id's when checking for a match

string $mask

if $hash is for a word, after the 9th byte what meta word mask should be applied to the 20 byte hash

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

first offset, last offset, count, exact matching id ( recall match can ignore low order shift bits)

getWordString()

getWordString(boolean  $is_disk, integer  $start, integer  $location, integer  $word_item_len) 

Return word record (word key + posting lookup data )from the shard from the shard posting list

Parameters

boolean $is_disk

whether the shard is on disk or in memory

integer $start

offset to start of the dictionary

integer $location

index of record to extract from dictionary

integer $word_item_len

length of a word + data record

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

getPostingsSlice()

getPostingsSlice(integer  $start_offset, \seekquarry\yioop\library\int&  $next_offset, integer  $last_offset, integer  $len) : array

Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.

Parameters

integer $start_offset

of the current posting list for query term used in calculating BM25F.

\seekquarry\yioop\library\int& $next_offset

where to start in word docs

integer $last_offset

offset at which to stop by

integer $len

number of documents desired

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

desired list of doc's and their info

numDocsOrLinks()

numDocsOrLinks(integer  $start_offset, integer  $last_offset, float  $avg_posting_len = 4) : integer

An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.

Parameters

integer $start_offset

starting location in posting list

integer $last_offset

ending location in posting list

float $avg_posting_len

number of bytes in an average posting

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

number of docs or links

makeItem()

makeItem(string  $posting, integer  $num_doc_or_links, integer  $occurs) : array

Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.

Returns the doc_id of the document

Parameters

string $posting

a posting entry from some words posting list

integer $num_doc_or_links

number of documents or links doc appears in

integer $occurs

number of occurrences of the current word in the document. If nonzero, this overrides the number of occurrences in various parts of a document that would be determined by its position list. Typically, would only override for meta words.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

($doc_id, posting_stats_array) for posting

weightedCount()

weightedCount(array  $position_list, boolean  $is_doc) : array

Used to sum over the occurences in a position list counting with weight based on term location in the document

Parameters

array $position_list

positions of term in item

boolean $is_doc

whether the item is a document or a link

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

asscoiative array of document_part => weight count of occurrences of term in

computeProximity()

computeProximity(array  $position_list, boolean  $is_doc) : integer

Returns a proximity score for a single term based on its location in doc.

Parameters

array $position_list

locations of term within item

boolean $is_doc

whether the item is a document or not

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

a score for proximity

docStats()

docStats(\seekquarry\yioop\library\array&  $item, integer  $occurrences, integer  $doc_len, integer  $num_doc_or_links, float  $average_doc_len, integer  $num_docs, integer  $total_docs_or_links, float  $type_weight) 

Computes BM25F relevance and a score for the supplied item based on the supplied parameters.

Parameters

\seekquarry\yioop\library\array& $item

doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed

integer $occurrences
  • number of occurences of the term in the item
integer $doc_len

number of words in doc item represents

integer $num_doc_or_links

number of links or docs containing the term

float $average_doc_len

average length of items in corpus

integer $num_docs

either number of links or number of docs depending if item represents a link or a doc.

integer $total_docs_or_links

number of docs or links in corpus

float $type_weight

BM25F weight for this component (doc or link) of score

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

getPostingAtOffset()

getPostingAtOffset(integer  $current, \seekquarry\yioop\library\int&  $posting_start, \seekquarry\yioop\library\int&  $posting_end) : string

Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting

Parameters

integer $current

an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN

\seekquarry\yioop\library\int& $posting_start

after function call will be index of start of nearest posting to current

\seekquarry\yioop\library\int& $posting_end

after function call will be index of end of nearest posting to current

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

string —

the substring of word_docs corresponding to the posting

getDocIndexOfPostingAtOffset()

getDocIndexOfPostingAtOffset(integer  $current) : integer

Returns the document index of the posting at offset $current in word_docs

Parameters

integer $current

an offset into the posting lists (word_docs)

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

the doc index of the pointed to posting

nextPostingOffsetDocOffset()

nextPostingOffsetDocOffset(integer  $start_offset, integer  $end_offset, integer  $doc_offset) : array

Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).

Parameters

integer $start_offset

first posting to consider

integer $end_offset

last posting before give up

integer $doc_offset

document offset we want to be greater than or equal to

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

(int offset to next posting, doc_offset for this post)

gallopPostingOffsetDocOffset()

gallopPostingOffsetDocOffset(\seekquarry\yioop\library\int&  $current, integer  $doc_index, integer  $end) : integer

Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found

Parameters

\seekquarry\yioop\library\int& $current

current posting offset into posting list

integer $doc_index

document index want bigger than or equal to

integer $end

last index of posting list

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end.

docOffsetFromPostingOffset()

docOffsetFromPostingOffset(integer  $offset) : integer

Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.

Parameters

integer $offset

byte/char offset into the word_docs string

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

a document byte/char offset into the doc_infos string

getPostingsSliceById()

getPostingsSliceById(string  $word_id, integer  $len) : array

Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)

Parameters

string $word_id

key to look up documents for

integer $len

number of documents

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

desired list of doc's and their info

appendIndexShard()

appendIndexShard(object  $index_shard) 

Adds the contents of the supplied $index_shard to the current index shard

Parameters

object $index_shard

the shard to append to the current shard

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

mergeWordPostingsToString()

mergeWordPostingsToString(boolean  $replace = false) 

Used to flatten the words associative array to a more memory efficient word_postings string.

$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory

$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.

mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.

Parameters

boolean $replace

whether to overwrite existing word_id postings (true) or to append (false)

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

changeDocumentOffsets()

changeDocumentOffsets(array  $docid_offsets) 

Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).

Parameters

array $docid_offsets

a set of doc_id associated with a new_doc_offset.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

prepareWordsAndPrefixes()

prepareWordsAndPrefixes(boolean  $with_logging = false) 

Computes the prefix string index for the current words array.

This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings

Parameters

boolean $with_logging

whether log messages should be written as progresses

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

packWords()

packWords(resource  $fh = null, boolean  $with_logging = false) 

Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string help by word_postings. packWords separates words from postings.

After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.

Parameters

resource $fh

a file handle to write the dictionary to, if desired

boolean $with_logging

whether to write progress log messages every 30 seconds

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

outputPostingLists()

outputPostingLists(resource  $fh = null, boolean  $with_logging = false) 

Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.

Parameters

resource $fh

a filehandle to write to

boolean $with_logging

whether to log progress

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

unpackWordDocs()

unpackWordDocs() 

Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.

This method is memory expensive as it briefly has essentially two copies of what's in word_docs.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

getWordDocsSubstring()

getWordDocsSubstring(  $offset,   $len) : \seekquarry\yioop\library\desired

From disk gets $len many bytes starting from $offset in the word_docs strings

Parameters

$offset

byte offset to begin getting data out of disk-based word_docs

$len

number of bytes to get

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

\seekquarry\yioop\library\desired —

string

getWordDocsWord()

getWordDocsWord(integer  $offset) 

Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the sahrd

Parameters

integer $offset

a byte offset into the word_docs string

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

getDocInfoSubstring()

getDocInfoSubstring(  $offset,   $len) : \seekquarry\yioop\library\desired

From disk gets $len many bytes starting from $offset in the doc_infos strings

Parameters

$offset

byte offset to begin getting data out of disk-based doc_infos

$len

number of bytes to get

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

\seekquarry\yioop\library\desired —

string

getShardSubstring()

getShardSubstring(integer  $offset, integer  $len, boolean  $cache = true) : string

Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard

Parameters

integer $offset

byte offset to start reading from

integer $len

number of bytes to read

boolean $cache

whether to cache disk blocks read from disk

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

string —

data from that location in the shard

getShardWord()

getShardWord(integer  $offset) : integer

Reads 32 bit word as an unsigned int from the offset given in the shard

Parameters

integer $offset

a byte offset into the shard

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

desired word or false

readBlockShardAtOffset()

readBlockShardAtOffset(integer  $bytes, boolean  $cache = true) : \seekquarry\yioop\library\&string

Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes

Parameters

integer $bytes

byte offset to start reading from

boolean $cache

whether to cache disk blocks that have been read to RAM

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

\seekquarry\yioop\library\&string —

data fromIndexShard file

getShardHeader()

getShardHeader() : boolean

If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

boolean —

whether was able to read in or not

packDoclenNum()

packDoclenNum(integer  $doc_len, integer  $num_keys) : string

Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)

Parameters

integer $doc_len

number of words in the document

integer $num_keys

number of keys that are used to make up its doc_id

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

string —

packed int string representing these two values

unpackDoclenNum()

unpackDoclenNum(integer  $doc_info) : array

Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id

Parameters

integer $doc_info

integer to unpack

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

pair (number of words in the document, number of keys that are used to make up its doc_id)

getWordInfoFromString()

getWordInfoFromString(string  $str, boolean  $include_generation = false) : array

Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.

Parameters

string $str
boolean $include_generation
Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

of these three or four int's

headerToShardFields()

headerToShardFields(string  $header, object  $shard) 

Split a header string into a shards field variable

Parameters

string $header

a string with packed shard header data

object $shard

IndexShard to put data into

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

makeWords()

makeWords(\seekquarry\yioop\library\string&  $value, integer  $key, object  $shard) 

Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.

Parameters

\seekquarry\yioop\library\string& $value

the word_key . word_info string

integer $key

index in array - we don't use

object $shard

IndexShard to add the entry to word table for

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293