seek_quarry
[ class tree: seek_quarry ] [ index: seek_quarry ] [ all elements ]

Class: GroupIterator

Source Location: /lib/index_bundle_iterators/group_iterator.php

Class Overview

IndexBundleIterator
   |
   --GroupIterator

This iterator is used to group together documents or document parts


Author(s):

  • Chris Pollett

Variables

Constants

Methods


Inherited Constants

Inherited Variables

Inherited Methods

Class: IndexBundleIterator

IndexBundleIterator::advance()
Forwards the iterator one group of docs
IndexBundleIterator::advanceSeenDocs()
Updates the seen_docs count during an advance() call
IndexBundleIterator::computeRelevance()
Computes a relevancy score for a posting offset with respect to this
IndexBundleIterator::currentDocsWithWord()
Gets the current block of doc ids and score associated with the this iterators word
IndexBundleIterator::currentGenDocOffsetWithWord()
Gets the doc_offset and generation for the next document that would be return by this iterator
IndexBundleIterator::findDocsWithWord()
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
IndexBundleIterator::genDocOffsetCmp()
Compares two arrays each containing a (generation, offset) pair.
IndexBundleIterator::getCurrentDocsForKeys()
Gets the summaries associated with the keys provided the keys
IndexBundleIterator::nextDocsWithWord()
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
IndexBundleIterator::reset()
Returns the iterators to the first document block that it could iterate
IndexBundleIterator::setResultsPerBlock()
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

Class Details

[line 55]
This iterator is used to group together documents or document parts

which share the same url. For instance, a link document item and the document that it links to will both be stored in the IndexArchiveBundle by the QueueServer. This iterator would combine both these items into a single document result with a sum of their score, and a summary, if returned, containing text from both sources. The iterator's purpose is vaguely analagous to a SQL GROUP BY clause




Tags:

author:  Chris Pollett
see:  IndexArchiveBundle


[ Top ]


Class Variables

$count_block =

[line 74]

The number of documents in the current block after filtering

by restricted words



Type:   int
Overrides:   Array


[ Top ]

$count_block_unfiltered =

[line 68]

The number of documents in the current block before filtering

by restricted words



Type:   int


[ Top ]

$current_block_hashes =

[line 81]

hashes of document web pages seen in results returned from the

most recent call to findDocsWithWord



Type:   array


[ Top ]

$current_machine =

[line 118]

Id of queue_server this group_iterator lives on


Type:   int


[ Top ]

$domain_factors =

[line 107]

Used to keep track and to weight pages based on the number of other

pages from the same domain



Type:   array


[ Top ]

$grouped_hashes =

[line 100]

hashed of document web pages used to keep track of track of

groups seen so far



Type:   array


[ Top ]

$grouped_keys =

[line 93]

hashed url keys used to keep track of track of groups seen so far


Type:   array


[ Top ]

$index_bundle_iterator =

[line 61]

The iterator we are using to get documents from


Type:   string


[ Top ]

$network_flag =

[line 113]

Whether the iterator is being used for a network query


Type:   bool


[ Top ]

$seen_docs_unfiltered =

[line 87]

The number of iterated docs before the restriction test


Type:   int


[ Top ]



Class Methods


constructor __construct [line 144]

GroupIterator __construct( object $index_bundle_iterator, [int $num_iterators = 1], [int $current_machine = 0], [bool $network_flag = false])

Creates a group iterator with the given parameters.



Parameters:

object   $index_bundle_iterator   to use as a source of documents to iterate over
int   $num_iterators   number of word iterators appearing in in sub-iterators -- if larger than reduce the default grouping number
int   $current_machine   if this iterator is being used in a multi- queue_server setting, then this is the id of the current queue_server
bool   $network_flag   the iterator is being used for a network query

[ Top ]

method advance [line 558]

void advance( [array $gen_doc_offset = NULL])

Forwards the iterator one group of docs



Overrides IndexBundleIterator::advance() (Forwards the iterator one group of docs)

Parameters:

array   $gen_doc_offset   a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

[ Top ]

method aggregateScores [line 503]

void aggregateScores( $hash_url, array &$pre_hash_page)

For a collection of pages each with the same url, computes the page with the min score, max score, as well as the sum of the score, sum of the ranks, sum of the relevance score, and count. Stores this information in the first element of the array of pages.



Parameters:

array   &$pre_hash_page   pages to compute scores for
   $hash_url  

[ Top ]

method computeOutPages [line 445]

array computeOutPages( array &$pre_out_pages)

For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.



Tags:

return:  array of single summarized documents


Parameters:

array   &$pre_out_pages   array of groups of pages for which out pages are to be generated.

[ Top ]

method computeRelevance [line 182]

float computeRelevance( int $generation, int $posting_offset)

Computes a relevancy score for a posting offset with respect to this

iterator and generation




Tags:

return:  a relevancy score based on BM25F.


Overrides IndexBundleIterator::computeRelevance() (Computes a relevancy score for a posting offset with respect to this)

Parameters:

int   $generation   the generation the posting offset is for
int   $posting_offset   an offset into word_docs to compute the relevance of

[ Top ]

method currentGenDocOffsetWithWord [line 597]

mixed currentGenDocOffsetWithWord( )

Gets the doc_offset and generation for the next document that would be return by this iterator



Tags:

return:  an array with the desired document offset and generation; -1 on fail


Overrides IndexBundleIterator::currentGenDocOffsetWithWord() (Gets the doc_offset and generation for the next document that would be return by this iterator)

[ Top ]

method findDocsWithWord [line 194]

mixed findDocsWithWord( )

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached



Tags:

return:  doc ids and score if there are docs left, -1 otherwise


Overrides IndexBundleIterator::findDocsWithWord() (Hook function used by currentDocsWithWord to return the current block of docs if it is not cached)

[ Top ]

method getPagesToGroup [line 232]

array getPagesToGroup( )

Gets a sample of a few hundred pages on which to do grouping by URL



Tags:

return:  of pages of document key --> meta data arrays


[ Top ]

method groupByHashAndAggregate [line 314]

void groupByHashAndAggregate( array &$pre_out_pages)

For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.

For each group of groups with the same hash summary, this function then selects the subgroup of with the highest aggregate score for that group as its representative. The function then modifies the supplied argument array to make it an array of group representatives.




Parameters:

array   &$pre_out_pages   documents previously grouped by hash of url

[ Top ]

method groupByHashUrl [line 269]

array groupByHashUrl( array &$pages)

Groups documents as well as mini-pages based on links to documents by

url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.




Tags:

return:  pages after grouping


Parameters:

array   &$pages   pages to group

[ Top ]

method lookupDoc [line 380]

array lookupDoc( string $doc_key, $index_name, [bool $is_location = false], [int $depth = 3])

Looks up a doc for a link doc_key, so can get its summary info



Tags:

return:  consisting of info about the doc


Parameters:

string   $doc_key   key to look up doc of
bool   $is_location   we are doing look up because doc had a refresh
int   $depth   max recursion depth to carry out lookup to if need to follow location redirects
   $index_name  

[ Top ]

method reset [line 164]

void reset( )

Returns the iterators to the first document block that it could iterate

over




Overrides IndexBundleIterator::reset() (Returns the iterators to the first document block that it could iterate)

[ Top ]


Class Constants

MIN_DESCRIPTION_LENGTH =  10

[line 130]

the minimum length of a description before we stop appending

additional link doc summaries



[ Top ]

MIN_FIND_RESULTS_PER_BLOCK =  MIN_RESULTS_TO_GROUP

[line 124]

the minimum number of pages to group from a block;

this trumps $this->index_bundle_iterator->results_per_block



[ Top ]



Documentation generated by phpDocumentor 1.4.3