Class: IndexShard
Source Location: /lib/index_shard.php
PersistentStructure
|
--IndexShard
Data structure used to store one generation worth of the word document index (inverted index).
Author(s):
Implements interfaces:
|
|
|
|
Inherited Constants
|
Inherited Variables
|
Inherited Methods
|
Class Details
Class Variables
$blocks =
[line 205]
An cached array of disk blocks for an index shard that has not been completely loaded into memory.
$docids_len =
[line 104]
Length of $doc_infos as a string
$doc_infos =
[line 99]
Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves. In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".
$fh =
[line 198]
File handle for a shard if we are going to use it in read mode and not completely load it.
$file_len =
[line 227]
Keeps track of the length of the shard as a file
$generation =
[line 162]
This is supposed to hold the number of earlier shards, prior to the current shard.
$last_flattened_words_count =
[line 233]
Number of document inserts since the last time word data was flattened to the word_postings string.
$len_all_docs =
[line 185]
Number of words stored in total in all documents in this shard
$len_all_link_docs =
[line 190]
Number of words stored in total in all links in this shard
$num_docs =
[line 175]
Number of documents (not links) stored in this shard
$num_docs_per_generation =
[line 169]
This is supposed to hold the number of documents that a given shard can hold.
$num_link_docs =
[line 180]
Number of links (not documents) stored in this shard
$prefixes =
[line 148]
An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.
$prefixes_len =
[line 155]
Length of the prefix index into the dictionary of the shard
$read_only_from_disk =
[line 213]
Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
$words =
[line 132]
Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
$words_len =
[line 140]
Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode
$word_docs =
[line 114]
This string is non-empty when shard is loaded and in its packed state. It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.
$word_docs_len =
[line 119]
Length of $word_docs as a string
$word_docs_packed =
[line 220]
Keeps track of the packed/unpacked state of the word_docs list
$word_postings =
[line 240]
Used to hold word_id, posting_len, posting triples as a memory efficient string
Class Methods
static method docStats [line 767]
static void docStats(
array
&$item, int
$occurrences, int
$doc_len,
$num_doc_or_links, float
$average_doc_len, int
$num_docs, int
$total_docs_or_links, float
$type_weight, int
$num_doc_or_link)
|
|
Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
Parameters:
static method getWordInfoFromString [line 1596]
static array getWordInfoFromString(
string
$str, [bool
$include_generation = false])
|
|
Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
Tags:
Parameters:
static method headerToShardFields [line 1663]
static void headerToShardFields(
string
$header, object shard
$shard)
|
|
Split a header string into a shards field variable
Parameters:
static method load [line 1616]
static object the load(
string
$fname, [string
&$data = NULL])
|
|
Load an IndexShard from a file or string
Tags:
Overrides PersistentStructure::load() (Load a PersistentStructure from a file)
Parameters:
static method makeWords [line 1689]
static void makeWords(
string
&$value, int
$key, object
$shard)
|
|
Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
Parameters:
static method numDocsOrLinks [line 582]
static int numDocsOrLinks(
int
$start_offset, int
$last_offset)
|
|
An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
Tags:
Parameters:
static method packDoclenNum [line 1566]
static string packDoclenNum(
int
$doc_len, int
$num_keys)
|
|
Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
Tags:
Parameters:
static method unpackDoclenNum [line 1580]
static array unpackDoclenNum(
int
$doc_info)
|
|
Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
Tags:
Parameters:
method addDocumentWords [line 368]
bool addDocumentWords(
string
$doc_keys, int
$summary_offset, array
$word_lists, [array
$meta_ids = array()], [bool
$is_doc = false], [mixed
$rank = false])
|
|
Add a new document to the index shard with the given summary offset. Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.
Tags:
Parameters:
method appendIndexShard [line 946]
void appendIndexShard(
object
$index_shard)
|
|
Adds the contents of the supplied $index_shard to the current index shard
Parameters:
constructor __construct [line 322]
IndexShard __construct(
string
$fname, [int
$generation = 0], [
$num_docs_per_generation = NUM_DOCS_PER_GENERATION], [bool
$read_only_from_disk = false])
|
|
Makes an index shard with the given file name and generation offset
Overrides PersistentStructure::__construct() (Sets up the file name and save frequency for the PersistentStructure, initializes the oepration count)
Parameters:
method changeDocumentOffsets [line 1115]
void changeDocumentOffsets(
array
$docid_offsets)
|
|
Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
Parameters:
method computeProximity [line 746]
int computeProximity(
array
$position_list, bool
$is_doc)
|
|
Returns a proximity score for a single term based on its location in doc.
Tags:
Parameters:
method docOffsetFromPostingOffset [line 913]
int docOffsetFromPostingOffset(
int
$offset)
|
|
Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
Tags:
Parameters:
method getDocIndexOfPostingAtOffset [line 831]
int getDocIndexOfPostingAtOffset(
int
$current)
|
|
Returns the document index of the posting at offset $current in word_docs
Tags:
Parameters:
method getDocInfoSubstring [line 1444]
desired getDocInfoSubstring(
$offset
$offset, $len
$len)
|
|
From disk gets $len many bytes starting from $offset in the doc_infos strings
Tags:
Parameters:
method getPostingAtOffset [line 800]
string getPostingAtOffset(
int
$current, int
&$posting_start, int
&$posting_end)
|
|
Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
Tags:
Parameters:
method getPostingsSlice [line 536]
array getPostingsSlice(
int
$start_offset, int
&$next_offset, int
$last_offset, int
$len)
|
|
Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
Tags:
Parameters:
method getPostingsSliceById [line 927]
array getPostingsSliceById(
string
$word_id, int
$len)
|
|
Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
Tags:
Parameters:
method getShardHeader [line 1547]
If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
method getShardSubstring [line 1462]
string getShardSubstring(
int
$offset, int
$len, [bool
$cache = true])
|
|
Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
Tags:
Parameters:
method getShardWord [line 1494]
void getShardWord(
int
$offset)
|
|
Reads 32 bit word as an unsigned int from the offset given in the shard
Parameters:
method getWordDocsSubstring [line 1413]
desired getWordDocsSubstring(
$offset
$offset, $len
$len)
|
|
From disk gets $len many bytes starting from $offset in the word_docs strings
Tags:
Parameters:
method getWordDocsWord [line 1427]
void getWordDocsWord(
int
$offset)
|
|
Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the sahrd
Parameters:
method getWordInfo [line 453]
array getWordInfo(
string
$word_id, [bool
$raw = false])
|
|
Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
Tags:
Parameters:
method makeItem [line 603]
array makeItem(
string
$posting, int
$num_doc_or_links, [int
$occurs = 0])
|
|
Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc. Returns the doc_id of the document
Tags:
Parameters:
method mergeWordPostingsToString [line 1009]
void mergeWordPostingsToString(
[bool
$replace = false])
|
|
Used to flatten the words associative array to a more memory efficient word_postings string.
Parameters:
method nextPostingOffsetDocOffset [line 860]
int nextPostingOffsetDocOffset(
int
$start_offset, int
$end_offset, int
$doc_offset)
|
|
Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
Tags:
Parameters:
method outputPostingLists [line 1329]
void outputPostingLists(
[resource
$fh = NULL])
|
|
Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
Parameters:
method packWords [line 1278]
void packWords(
[resource
$fh = NULL], bool
$to_string)
|
|
Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string help by word_postings. packWords separates words from postings. After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given its write the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.
Parameters:
method prepareWordsAndPrefixes [line 1218]
void prepareWordsAndPrefixes(
)
|
|
Computes the prefix string index for the current words array. This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings
method readBlockShardAtOffset [line 1514]
&string readBlockShardAtOffset(
int
$bytes, [bool
$cache = true])
|
|
Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
Tags:
Parameters:
method save [line 1158]
string save(
[bool
$to_string = false], [bool
$with_logging = false])
|
|
Save the IndexShard to its filename
Tags:
Overrides PersistentStructure::save() (Save the PersistentStructure to its filename)
Parameters:
method unpackWordDocs [line 1380]
Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array. This method is memory expensive as it briefly has essentially two copies of what's in word_docs.
method weightedCount [line 719]
array weightedCount(
array
$position_list, bool
$is_doc)
|
|
Used to sum over the occurences in a position list counting with weight based on term location in the document
Tags:
Parameters:
Class Constants
|
|