seek_quarry
[ class tree: seek_quarry ] [ index: seek_quarry ] [ all elements ]

Class: IndexShard

Source Location: /lib/index_shard.php

Class Overview

PersistentStructure
   |
   --IndexShard

Data structure used to store one generation worth of the word document index (inverted index).


Author(s):

  • Chris Pollett

Implements interfaces:

Variables

Constants

Methods


Inherited Constants

Inherited Variables

Inherited Methods

Class: PersistentStructure

PersistentStructure::__construct()
Sets up the file name and save frequency for the PersistentStructure, initializes the oepration count
PersistentStructure::checkSave()
Add one to the unsaved_operations count. If this goes above the
PersistentStructure::load()
Load a PersistentStructure from a file
PersistentStructure::save()
Save the PersistentStructure to its filename

Class Details

[line 81]
Data structure used to store one generation worth of the word document index (inverted index).

This data structure consists of three main components a word entries, word_doc entries, and document entries.

Word entries are described in the documentation for the words field.

Word-doc entries are described in the documentation for the word_docs field

Document entries are described in the documentation for the doc_infos field

IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.

Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.




Tags:

author:  Chris Pollett


[ Top ]


Class Variables

$blocks =

[line 205]

An cached array of disk blocks for an index shard that has not been completely loaded into memory.


Type:   array


[ Top ]

$docids_len =

[line 104]

Length of $doc_infos as a string


Type:   int


[ Top ]

$doc_infos =

[line 99]

Stores document id's and links to documents id's together with

summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves. In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".



Type:   string


[ Top ]

$fh =

[line 198]

File handle for a shard if we are going to use it in read mode and not completely load it.


Type:   resource


[ Top ]

$file_len =

[line 227]

Keeps track of the length of the shard as a file


Type:   int


[ Top ]

$generation =

[line 162]

This is supposed to hold the number of earlier shards, prior to the current shard.


Type:   int


[ Top ]

$last_flattened_words_count =

[line 233]

Number of document inserts since the last time word data was flattened to the word_postings string.


Type:   mixed


[ Top ]

$len_all_docs =

[line 185]

Number of words stored in total in all documents in this shard


Type:   int


[ Top ]

$len_all_link_docs =

[line 190]

Number of words stored in total in all links in this shard


Type:   int


[ Top ]

$num_docs =

[line 175]

Number of documents (not links) stored in this shard


Type:   int


[ Top ]

$num_docs_per_generation =

[line 169]

This is supposed to hold the number of documents that a given shard can hold.


Type:   int


[ Top ]

$num_link_docs =

[line 180]

Number of links (not documents) stored in this shard


Type:   int


[ Top ]

$prefixes =

[line 148]

An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.


Type:   array


[ Top ]

$prefixes_len =

[line 155]

Length of the prefix index into the dictionary of the shard


Type:   int


[ Top ]

$read_only_from_disk =

[line 213]

Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.


Type:   bool


[ Top ]

$words =

[line 132]

Stores the array of word entries for this shard

In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP



Type:   array


[ Top ]

$words_len =

[line 140]

Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode


Type:   int


[ Top ]

$word_docs =

[line 114]

This string is non-empty when shard is loaded and in its packed state.

It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.



Type:   string


[ Top ]

$word_docs_len =

[line 119]

Length of $word_docs as a string


Type:   int


[ Top ]

$word_docs_packed =

[line 220]

Keeps track of the packed/unpacked state of the word_docs list


Type:   bool


[ Top ]

$word_postings =

[line 240]

Used to hold word_id, posting_len, posting triples as a memory efficient

string



Type:   string


[ Top ]



Class Methods


static method docStats [line 767]

static void docStats( array &$item, int $occurrences, int $doc_len, $num_doc_or_links, float $average_doc_len, int $num_docs, int $total_docs_or_links, float $type_weight, int $num_doc_or_link)

Computes BM25F relevance and a score for the supplied item based on the supplied parameters.



Parameters:

array   &$item   doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed
int   $occurrences   - number of occurences of the term in the item
int   $doc_len   number of words in doc item represents
int   $num_doc_or_link   number of links or docs containing the term
float   $average_doc_len   average length of items in corpus
int   $num_docs   either number of links or number of docs depending if item represents a link or a doc.
int   $total_docs_or_links   number of docs or links in corpus
float   $type_weight   BM25F weight for this component (doc or link) of score
   $num_doc_or_links  

[ Top ]

static method getWordInfoFromString [line 1596]

static array getWordInfoFromString( string $str, [bool $include_generation = false])

Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.



Tags:

return:  of these three or four int's


Parameters:

string   $str  
bool   $include_generation  

[ Top ]

static method headerToShardFields [line 1663]

static void headerToShardFields( string $header, object shard $shard)

Split a header string into a shards field variable



Parameters:

string   $header   a string with packed shard header data
object shard   $shard   IndexShard to put data into

[ Top ]

static method load [line 1616]

static object the load( string $fname, [string &$data = NULL])

Load an IndexShard from a file or string



Tags:

return:  IndexShard loaded


Overrides PersistentStructure::load() (Load a PersistentStructure from a file)

Parameters:

string   $fname   the name of the file to the IndexShard from/to
string   &$data   stringified shard data to load shard from. If NULL then the data is loaded from the $fname if possible

[ Top ]

static method makeWords [line 1689]

static void makeWords( string &$value, int $key, object $shard)

Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.



Parameters:

int   $key   index in array - we don't use
object   $shard   IndexShard to add the entry to word table for
string   &$value   &value the word_key . word_info string

[ Top ]

static method numDocsOrLinks [line 582]

static int numDocsOrLinks( int $start_offset, int $last_offset)

An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.



Tags:

return:  number of docs or links


Parameters:

int   $start_offset   starting location in posting list
int   $last_offset   ending location in posting list

[ Top ]

static method packDoclenNum [line 1566]

static string packDoclenNum( int $doc_len, int $num_keys)

Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)



Tags:

return:  packed int string representing these two values


Parameters:

int   $doc_len   number of words in the document
int   $num_keys   number of keys that are used to make up its doc_id

[ Top ]

static method unpackDoclenNum [line 1580]

static array unpackDoclenNum( int $doc_info)

Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id



Tags:

return:  pair (number of words in the document, number of keys that are used to make up its doc_id)


Parameters:

int   $doc_info   integer to unpack

[ Top ]

method addDocumentWords [line 368]

bool addDocumentWords( string $doc_keys, int $summary_offset, array $word_lists, [array $meta_ids = array()], [bool $is_doc = false], [mixed $rank = false])

Add a new document to the index shard with the given summary offset.

Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.




Tags:

return:  success or failure of performing the add


Parameters:

string   $doc_keys   a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.
int   $summary_offset   its offset into the word archive the document's data is stored in
array   $word_lists   (word => array of word positions in doc)
array   $meta_ids   meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.
bool   $is_doc   flag used to indicate if what is being sored is a document or a link to a document
mixed   $rank   either false if not used, or a 4 bit estimate of the rank of this document item

[ Top ]

method appendIndexShard [line 946]

void appendIndexShard( object $index_shard)

Adds the contents of the supplied $index_shard to the current index shard



Parameters:

object   $index_shard   the shard to append to the current shard

[ Top ]

constructor __construct [line 322]

IndexShard __construct( string $fname, [int $generation = 0], [ $num_docs_per_generation = NUM_DOCS_PER_GENERATION], [bool $read_only_from_disk = false])

Makes an index shard with the given file name and generation offset



Overrides PersistentStructure::__construct() (Sets up the file name and save frequency for the PersistentStructure, initializes the oepration count)

Parameters:

string   $fname   filename to store the index shard with
int   $generation   when returning documents from the shard pretend there ar ethis many earlier documents
bool   $read_only_from_disk   used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
   $num_docs_per_generation  

[ Top ]

method changeDocumentOffsets [line 1115]

void changeDocumentOffsets( array $docid_offsets)

Changes the summary offsets associated with a set of doc_ids to new

values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).




Parameters:

array   $docid_offsets   a set of doc_id associated with a new_doc_offset.

[ Top ]

method computeProximity [line 746]

int computeProximity( array $position_list, bool $is_doc)

Returns a proximity score for a single term based on its location in doc.



Tags:

return:  a score for proximity


Parameters:

array   $position_list   locations of term within item
bool   $is_doc   whether the item is a document or not

[ Top ]

method docOffsetFromPostingOffset [line 913]

int docOffsetFromPostingOffset( int $offset)

Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.



Tags:

return:  a document byte/char offset into the doc_infos string


Parameters:

int   $offset   byte/char offset into the word_docs string

[ Top ]

method getDocIndexOfPostingAtOffset [line 831]

int getDocIndexOfPostingAtOffset( int $current)

Returns the document index of the posting at offset $current in

word_docs




Tags:

return:  the doc index of the pointed to posting


Parameters:

int   $current   an offset into the posting lists (word_docs)

[ Top ]

method getDocInfoSubstring [line 1444]

desired getDocInfoSubstring( $offset $offset, $len $len)

From disk gets $len many bytes starting from $offset in the doc_infos strings



Tags:

return:  string


Parameters:

$offset   $offset   byte offset to begin getting data out of disk-based doc_infos
$len   $len   number of bytes to get

[ Top ]

method getPostingAtOffset [line 800]

string getPostingAtOffset( int $current, int &$posting_start, int &$posting_end)

Gets the posting closest to index $current in the word_docs string

modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting




Tags:

return:  the substring of word_docs corresponding to the posting


Parameters:

int   $current   an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN
int   &$posting_start   after function call will be index of start of nearest posting to current
int   &$posting_end   after function call will be index of end of nearest posting to current

[ Top ]

method getPostingsSlice [line 536]

array getPostingsSlice( int $start_offset, int &$next_offset, int $last_offset, int $len)

Returns documents using the word_docs string (either as stored

on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.




Tags:

return:  desired list of doc's and their info


Parameters:

int   $start_offset   of the current posting list for query term used in calculating BM25F.
int   &$next_offset   where to start in word docs
int   $last_offset   offset at which to stop by
int   $len   number of documents desired

[ Top ]

method getPostingsSliceById [line 927]

array getPostingsSliceById( string $word_id, int $len)

Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)



Tags:

return:  desired list of doc's and their info


Parameters:

string   $word_id   key to look up documents for
int   $len   number of documents desired back (from start of word linked list).

[ Top ]

method getShardHeader [line 1547]

void getShardHeader( )

If not already loaded, reads in from disk the fixed-length'd field

variables of this IndexShard ($this->words_len, etc)




[ Top ]

method getShardSubstring [line 1462]

string getShardSubstring( int $offset, int $len, [bool $cache = true])

Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard



Tags:

return:  data fromthat location in the shard


Parameters:

int   $offset   byte offset to start reading from
int   $len   number of bytes to read
bool   $cache   whether to cache disk blocks read from disk

[ Top ]

method getShardWord [line 1494]

void getShardWord( int $offset)

Reads 32 bit word as an unsigned int from the offset given in the shard



Parameters:

int   $offset   a byte offset into the shard

[ Top ]

method getWordDocsSubstring [line 1413]

desired getWordDocsSubstring( $offset $offset, $len $len)

From disk gets $len many bytes starting from $offset in the word_docs strings



Tags:

return:  string


Parameters:

$offset   $offset   byte offset to begin getting data out of disk-based word_docs
$len   $len   number of bytes to get

[ Top ]

method getWordDocsWord [line 1427]

void getWordDocsWord( int $offset)

Reads 32 bit word as an unsigned int from the offset given in the

word_docs string in the sahrd




Parameters:

int   $offset   a byte offset into the word_docs string

[ Top ]

method getWordInfo [line 453]

array getWordInfo( string $word_id, [bool $raw = false])

Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.



Tags:

return:  first offset, last offset, count


Parameters:

string   $word_id   id of the word one wants to look up
bool   $raw   whether the id is our version of base64 encoded or not

[ Top ]

method makeItem [line 603]

array makeItem( string $posting, int $num_doc_or_links, [int $occurs = 0])

Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.

Returns the doc_id of the document




Tags:

return:  ($doc_id, posting_stats_array) for posting


Parameters:

string   $posting   a posting entry from some words posting list
int   $num_doc_or_links   number of documents or links doc appears in
int   $occurs   number of occurrences of the current word in the document. If nonzero, this overrides the number of occurrences in various parts of a document that would be determined by its position list. Typically, would only override for meta words.

[ Top ]

method mergeWordPostingsToString [line 1009]

void mergeWordPostingsToString( [bool $replace = false])

Used to flatten the words associative array to a more memory efficient word_postings string.



Parameters:

bool   $replace   whether to overwrite existing word_id postings (true) or to append (false)

[ Top ]

method nextPostingOffsetDocOffset [line 860]

int nextPostingOffsetDocOffset( int $start_offset, int $end_offset, int $doc_offset)

Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).



Tags:

return:  offset to next posting


Parameters:

int   $start_offset   first posting to consider
int   $end_offset   last posting before give up
int   $doc_offset   document offset we want to be greater than or equal to

[ Top ]

method outputPostingLists [line 1329]

void outputPostingLists( [resource $fh = NULL])

Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.



Parameters:

resource   $fh   a filehandle to write to

[ Top ]

method packWords [line 1278]

void packWords( [resource $fh = NULL], bool $to_string)

Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string help by word_postings. packWords separates words from postings.

After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given its write the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.




Parameters:

resource   $fh   a file handle to write the dictionary to, if desired
bool   $to_string   whether to return a string containing the packed data

[ Top ]

method prepareWordsAndPrefixes [line 1218]

void prepareWordsAndPrefixes( )

Computes the prefix string index for the current words array.

This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings




[ Top ]

method readBlockShardAtOffset [line 1514]

&string readBlockShardAtOffset( int $bytes, [bool $cache = true])

Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes



Tags:

return:  data fromIndexShard file


Parameters:

int   $bytes   byte offset to start reading from
bool   $cache   whether to cache disk blocks that have been read to RAM

[ Top ]

method save [line 1158]

string save( [bool $to_string = false], [bool $with_logging = false])

Save the IndexShard to its filename



Tags:

return:  serialized shard if output was to string else empty string


Overrides PersistentStructure::save() (Save the PersistentStructure to its filename)

Parameters:

bool   $to_string   whether output should be written to a string rather than the default file location
bool   $with_logging   whether log messages should be written as the shard save progresses

[ Top ]

method unpackWordDocs [line 1380]

void unpackWordDocs( )

Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.

This method is memory expensive as it briefly has essentially two copies of what's in word_docs.




[ Top ]

method weightedCount [line 719]

array weightedCount( array $position_list, bool $is_doc)

Used to sum over the occurences in a position list counting with weight based on term location in the document



Tags:

return:  asscoiative array of document_part => weight count of occurrences of term in


Parameters:

array   $position_list   positions of term in item
bool   $is_doc   whether the item is a document or a link

[ Top ]


Class Constants

BLANK =  "\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF"

[line 299]

Represents an empty prefix item


[ Top ]

DOC_KEY_LEN =  8

[line 289]

Length of a key in a DOC ID.


[ Top ]

FLATTEN_FREQUENCY =  10000

[line 247]

Fraction of NUM_DOCS_PER_GENERATION document inserts before data

from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)



[ Top ]

HALF_BLANK =  "\xFF\xFF\xFF\xFF"

[line 304]

Flag used to indicate that a word item should not be packed or unpacked


[ Top ]

HEADER_LENGTH =  40

[line 274]

Header Length of an IndexShard (sum of its non-variable length fields)


[ Top ]

LINK_FLAG =   0x800000

[line 258]

Used to keep track of whether a record in document infos is for a

document or for a link



[ Top ]

POSTING_LEN =  4

[line 294]

Length of one posting ( a doc offset occurrence pair) in a posting list


[ Top ]

SHARD_BLOCK_POWER =  12

[line 264]

Shard block size is 1<< this power


[ Top ]

SHARD_BLOCK_SIZE =  4096

[line 269]

Size in bytes of one block in IndexShard


[ Top ]

STORE_FLAG =  "\x80"

[line 309]

Represents an empty prefix item


[ Top ]

WORD_ITEM_LEN =  20

[line 279]

Length of a Word entry in bytes in the shard


[ Top ]

WORD_KEY_LEN =  8

[line 284]

Length of a word entry's key in bytes


[ Top ]

WORD_POSTING_COPY_LEN =  2000000

[line 252]

Bytes of tmp string allowed during flattenings


[ Top ]



Documentation generated by phpDocumentor 1.4.3