\seekquarry\yioop\library\index_bundle_iteratorsGroupIterator

This iterator is used to group together documents or document parts which share the same url. For instance, a link document item and the document that it links to will both be stored in the IndexArchiveBundle by the QueueServer. This iterator would combine both these items into a single document result with a sum of their score, and a summary, if returned, containing text from both sources. The iterator's purpose is vaguely analagous to a SQL GROUP BY clause

Summary

Methods
Properties
Constants
reset()
advance()
currentGenDocOffsetWithWord()
findDocsWithWord()
plan()
genDocOffsetCmp()
getDirection()
currentDocsWithWord()
getCurrentDocsForKeys()
nextDocsWithWord()
advanceSeenDocs()
setResultsPerBlock()
__construct()
getPagesToGroup()
groupByHashUrl()
groupByHashAndAggregate()
computeOutPages()
aggregateScores()
$num_docs
$seen_docs
$count_block
$pages
$current_block_fresh
$results_per_block
$index_bundle_iterator
$count_block_unfiltered
$current_block_hashes
$seen_docs_unfiltered
$grouped_keys
$grouped_hashes
$domain_factors
$current_machine
RESULTS_PER_BLOCK
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

RESULTS_PER_BLOCK

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)

Properties

$num_docs

$num_docs : integer

Estimate of the number of documents that this iterator can return

Type

integer

$seen_docs

$seen_docs : integer

The number of documents already iterated over

Type

integer

$count_block

$count_block : integer

The number of documents in the current block after filtering by restricted words

Type

integer

$pages

$pages : array

Cache of what currentDocsWithWord returns

Type

array

$current_block_fresh

$current_block_fresh : boolean

Says whether the value in $this->count_block is up to date

Type

boolean

$results_per_block

$results_per_block : integer

Number of documents returned for each block (at most)

Type

integer

$index_bundle_iterator

$index_bundle_iterator : string

The iterator we are using to get documents from

Type

string

$count_block_unfiltered

$count_block_unfiltered : integer

The number of documents in the current block before filtering by restricted words

Type

integer

$current_block_hashes

$current_block_hashes : array

hashes of document web pages seen in results returned from the most recent call to findDocsWithWord

Type

array

$seen_docs_unfiltered

$seen_docs_unfiltered : integer

The number of iterated docs before the restriction test

Type

integer

$grouped_keys

$grouped_keys : array

hashed url keys used to keep track of track of groups seen so far

Type

array

$grouped_hashes

$grouped_hashes : array

hashed of document web pages used to keep track of track of groups seen so far

Type

array

$domain_factors

$domain_factors : array

Used to keep track and to weight pages based on the number of other pages from the same domain

Type

array

$current_machine

$current_machine : integer

Id of queue_server this group_iterator lives on

Type

integer

Methods

reset()

reset() 

Returns the iterators to the first document block that it could iterate over

advance()

advance(array  $gen_doc_offset = null) 

Forwards the iterator one group of docs

Parameters

array $gen_doc_offset

a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

currentGenDocOffsetWithWord()

currentGenDocOffsetWithWord() : mixed

Gets the doc_offset and generation for the next document that would be return by this iterator

Returns

mixed —

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

findDocsWithWord() : mixed

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

Returns

mixed —

doc ids and score if there are docs left, -1 otherwise

plan()

plan() : string

Returns a string representation of a plan by which the current iterator finds its results

Returns

string —

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

genDocOffsetCmp()

genDocOffsetCmp(array  $gen_doc1, array  $gen_doc2, integer  $direction = self::ASCENDING) : integer

Compares two arrays each containing a (generation, offset) pair.

Parameters

array $gen_doc1

first ordered pair

array $gen_doc2

second ordered pair

integer $direction

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Returns

integer —

-1,0,1 depending on which is bigger

getDirection()

getDirection() : integer

Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.

Returns

integer —

direction traversing underlying archive bundle

currentDocsWithWord()

currentDocsWithWord() : mixed

Gets the current block of doc ids and score associated with the this iterators word

Returns

mixed —

doc ids and score if there are docs left, -1 otherwise

getCurrentDocsForKeys()

getCurrentDocsForKeys(array  $keys = null) : array

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

Parameters

array $keys

keys to try to find in the current block of returned results

Returns

array —

doc summaries that match provided keys

nextDocsWithWord()

nextDocsWithWord(  $doc_offset = null) : array

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

Parameters

$doc_offset

if set the next block must all have $doc_offsets equal to or larger than this value

Returns

array —

doc summaries matching the $this->restrict_phrases

advanceSeenDocs()

advanceSeenDocs() 

Updates the seen_docs count during an advance() call

setResultsPerBlock()

setResultsPerBlock(integer  $num) 

Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

Parameters

integer $num

the maximum number of results that can be returned by a block

__construct()

__construct(object  $index_bundle_iterator, integer  $num_iterators = 1, integer  $current_machine) 

Creates a group iterator with the given parameters.

Parameters

object $index_bundle_iterator

to use as a source of documents to iterate over

integer $num_iterators

number of word iterators appearing in in sub-iterators -- if larger than reduce the default grouping number

integer $current_machine

if this iterator is being used in a multi- queue_server setting, then this is the id of the current queue_server

getPagesToGroup()

getPagesToGroup() : array

Gets a sample of a few hundred pages on which to do grouping by URL

Returns

array —

of pages of document key --> meta data arrays

groupByHashUrl()

groupByHashUrl(\seekquarry\yioop\library\index_bundle_iterators\array&  $pages) : array

Groups documents as well as mini-pages based on links to documents by url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.

Parameters

\seekquarry\yioop\library\index_bundle_iterators\array& $pages

pages to group

Returns

array —

$pre_out_pages pages after grouping

groupByHashAndAggregate()

groupByHashAndAggregate(\seekquarry\yioop\library\index_bundle_iterators\array&  $pre_out_pages) 

For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.

For each group of groups with the same hash summary, this function then selects the subgroup of with the highest aggregate score for that group as its representative. The function then modifies the supplied argument array to make it an array of group representatives.

Parameters

\seekquarry\yioop\library\index_bundle_iterators\array& $pre_out_pages

documents previously grouped by hash of url

computeOutPages()

computeOutPages(\seekquarry\yioop\library\index_bundle_iterators\array&  $pre_out_pages) : array

For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.

Parameters

\seekquarry\yioop\library\index_bundle_iterators\array& $pre_out_pages

array of groups of pages for which out pages are to be generated.

Returns

array —

$out_pages array of single summarized documents

aggregateScores()

aggregateScores(string  $hash_url, \seekquarry\yioop\library\index_bundle_iterators\array&  $pre_hash_page) 

For a collection of pages each with the same url, computes the page with the min score, max score, as well as the sum of the score, aggregate of the ranks, proximity, and relevance scores, and a count.

Stores this information in the first element of the array of pages. This process is described in detail at: https://www.seekquarry.com/?c=main&p=ranking#search

Parameters

string $hash_url

the crawlHash of the url of the page we are scoring which will be compared with that of the host to see if the current page has the url of a hostname.

\seekquarry\yioop\library\index_bundle_iterators\array& $pre_hash_page

pages to compute scores for