\seekquarry\yioop\libraryIndexArchiveBundle

Encapsulates a set of web page summaries and an inverted word-index of terms from these summaries which allow one to search for summaries containing a particular word.

The basic file structures for an IndexArchiveBundle are:

  1. A WebArchiveBundle for web page summaries.
  2. A IndexDictionary containing all the words stored in the bundle. Each word entry in the dictionary contains starting and ending offsets for documents containing that word for some particular IndexShard generation.
  3. A set of index shard generations. These generations have names index0, index1,... A shard has word entries, word doc entries and document entries. For more information see the index shard documentation.
  4. The file generations.txt keeps track of what is the current generation. A given generation can hold NUM_WORDS_PER_GENERATION words amongst all its partitions. After which the next generation begins.

Summary

Methods
Properties
Constants
__construct()
addPages()
addIndexData()
initGenerationToAdd()
addAdvanceGeneration()
addCurrentShardDictionary()
getActiveShard()
getCurrentShard()
setCurrentShard()
getPage()
forceSave()
stopIndexingBundle()
countWordKeys()
getArchiveInfo()
setArchiveInfo()
getParamModifiedTime()
$dir_name
$description
$num_partitions_summaries
$generation_info
$num_docs_per_generation
$summaries
$dictionary
$current_shard
$version
NO_LOAD_SIZE
FORCE_ADVANCE_SIZE
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

NO_LOAD_SIZE

NO_LOAD_SIZE

Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard

FORCE_ADVANCE_SIZE

FORCE_ADVANCE_SIZE

Threshold index shard beyond which we force the generation to advance

Properties

$dir_name

$dir_name : string

Folder name to use for this IndexArchiveBundle

Type

string

$description

$description : string

A short text name for this IndexArchiveBundle

Type

string

$num_partitions_summaries

$num_partitions_summaries : integer

Number of partitions in the summaries WebArchiveBundle

Type

integer

$generation_info

$generation_info : array

structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).

Type

array

$num_docs_per_generation

$num_docs_per_generation : integer

Number of docs before a new generation is started

Type

integer

$summaries

$summaries : object

WebArchiveBundle for web page summaries

Type

object

$dictionary

$dictionary : object

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, .

..)

Type

object

$current_shard

$current_shard : object

Index Shard for current generation inverted word index

Type

object

$version

$version : integer

What version of index archive bundle this is

Type

integer

Methods

__construct()

__construct(string  $dir_name, boolean  $read_only_archive = true, string  $description = null, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION) 

Makes or initializes an IndexArchiveBundle with the provided parameters

Parameters

string $dir_name

folder name to store this bundle

boolean $read_only_archive

whether to open archive only for reading or reading and writing

string $description

a text name/serialized info about this IndexArchiveBundle

integer $num_docs_per_generation

the number of pages to be stored in a single shard

addPages()

addPages(integer  $generation, string  $offset_field, \seekquarry\yioop\library\array&  $pages, integer  $visited_urls_count) 

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

Parameters

integer $generation

field used to select partition

string $offset_field

field used to record offsets after storing

\seekquarry\yioop\library\array& $pages

data to store

integer $visited_urls_count

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

addIndexData()

addIndexData(object  $index_shard) 

Adds the provided mini inverted index data to the IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct

Parameters

object $index_shard

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

initGenerationToAdd()

initGenerationToAdd(integer  $add_num_docs, object  $callback = null, boolean  $blocking = false) : integer

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters

integer $add_num_docs

number of docs in the shard about to be added

object $callback

object with join function to be called if process is taking too long

boolean $blocking

whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Returns

integer —

the active generation after the check and possible change has been performed

addAdvanceGeneration()

addAdvanceGeneration(object  $callback = null) 

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

Parameters

object $callback

object with join function to be called if process is taking too long

addCurrentShardDictionary()

addCurrentShardDictionary(object  $callback = null) 

Adds the words from this shard to the dictionary

Parameters

object $callback

object with join function to be called if process is taking too long

getActiveShard()

getActiveShard() : object

Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard

Returns

object —

last shard in the bundle

getCurrentShard()

getCurrentShard(boolean  $force_read = false) : object

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

Parameters

boolean $force_read

whether to force no advance generation and merge dictionary side effects

Returns

object —

the currently being index shard

setCurrentShard()

setCurrentShard(  $i,   $disk_based = false) 

Sets the current shard to be the $i th shard in the index bundle.

Parameters

$i

which shard to set the current shard to be

$disk_based

whether to read the whole shard in before using or leave it on disk except for pages need

getPage()

getPage(integer  $offset, integer  $generation = -1) : array

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

Parameters

integer $offset

byte offset in partition of desired page

integer $generation

which generation WebArchive to look up in defaults to the same number as the current shard

Returns

array —

desired page

forceSave()

forceSave() 

Forces the current shard to be saved

stopIndexingBundle()

stopIndexingBundle() 

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

countWordKeys()

countWordKeys(array  $word_keys) : array

Computes the number of occurrences of each of the supplied list of word_keys

Parameters

array $word_keys

keys to compute counts for

Returns

array —

associative array of key => count values.

getArchiveInfo()

getArchiveInfo(string  $dir_name) : array

Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.

Parameters

string $dir_name

path to a directory containing a summaries WebArchiveBundle

Returns

array —

summary of the given archive

setArchiveInfo()

setArchiveInfo(string  $dir_name, array  $info) 

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

Parameters

string $dir_name

folder with archive bundle

array $info

struct with above fields

getParamModifiedTime()

getParamModifiedTime(string  $dir_name) 

Returns the last time the archive info of the bundle was modified.

Parameters

string $dir_name

folder with archive bundle