RESULTS_PER_BLOCK
RESULTS_PER_BLOCK
Default number of documents returned for each block (at most)
This iterator is used to group together documents or document parts which share the same url. For instance, a link document item and the document that it links to will both be stored in the IndexArchiveBundle by the QueueServer. This iterator would combine both these items into a single document result with a sum of their score, and a summary, if returned, containing text from both sources. The iterator's purpose is vaguely analagous to a SQL GROUP BY clause
genDocOffsetCmp(array $gen_doc1, array $gen_doc2, integer $direction = self::ASCENDING) : integer
Compares two arrays each containing a (generation, offset) pair.
array | $gen_doc1 | first ordered pair |
array | $gen_doc2 | second ordered pair |
integer | $direction | whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search |
-1,0,1 depending on which is bigger
getCurrentDocsForKeys(array $keys = null) : array
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
array | $keys | keys to try to find in the current block of returned results |
doc summaries that match provided keys
nextDocsWithWord( $doc_offset = null) : array
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
$doc_offset | if set the next block must all have $doc_offsets equal to or larger than this value |
doc summaries matching the $this->restrict_phrases
__construct(object $index_bundle_iterator, integer $num_iterators = 1, integer $current_machine)
Creates a group iterator with the given parameters.
object | $index_bundle_iterator | to use as a source of documents to iterate over |
integer | $num_iterators | number of word iterators appearing in in sub-iterators -- if larger than reduce the default grouping number |
integer | $current_machine | if this iterator is being used in a multi- queue_server setting, then this is the id of the current queue_server |
groupByHashUrl(\seekquarry\yioop\library\index_bundle_iterators\array& $pages) : array
Groups documents as well as mini-pages based on links to documents by url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.
\seekquarry\yioop\library\index_bundle_iterators\array& | $pages | pages to group |
$pre_out_pages pages after grouping
groupByHashAndAggregate(\seekquarry\yioop\library\index_bundle_iterators\array& $pre_out_pages)
For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.
For each group of groups with the same hash summary, this function then selects the subgroup of with the highest aggregate score for that group as its representative. The function then modifies the supplied argument array to make it an array of group representatives.
\seekquarry\yioop\library\index_bundle_iterators\array& | $pre_out_pages | documents previously grouped by hash of url |
computeOutPages(\seekquarry\yioop\library\index_bundle_iterators\array& $pre_out_pages) : array
For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.
\seekquarry\yioop\library\index_bundle_iterators\array& | $pre_out_pages | array of groups of pages for which out pages are to be generated. |
$out_pages array of single summarized documents
aggregateScores(string $hash_url, \seekquarry\yioop\library\index_bundle_iterators\array& $pre_hash_page)
For a collection of pages each with the same url, computes the page with the min score, max score, as well as the sum of the score, aggregate of the ranks, proximity, and relevance scores, and a count.
Stores this information in the first element of the array of pages. This process is described in detail at: https://www.seekquarry.com/?c=main&p=ranking#search
string | $hash_url | the crawlHash of the url of the page we are scoring which will be compared with that of the host to see if the current page has the url of a hostname. |
\seekquarry\yioop\library\index_bundle_iterators\array& | $pre_hash_page | pages to compute scores for |