\seekquarry\yioop\libraryFeedArchiveBundle

Subclass of IndexArchiveBundle with bloom filters to make it easy to check if a news feed item has been added to the bundle already before adding it

The basic file structures for an IndexArchiveBundle are:

  1. A WebArchiveBundle for web page summaries.
  2. A IndexDictionary containing all the words stored in the bundle. Each word entry in the dictionary contains starting and ending offsets for documents containing that word for some particular IndexShard generation.
  3. A set of index shard generations. These generations have names index0, index1,... A shard has word entries, word doc entries and document entries. For more information see the index shard documentation.
  4. The file generations.txt keeps track of what is the current generation. A given generation can hold NUM_WORDS_PER_GENERATION words amongst all its partitions. After which the next generation begins.

Summary

Methods
Properties
Constants
__construct()
addPages()
addIndexData()
initGenerationToAdd()
addAdvanceGeneration()
addCurrentShardDictionary()
getActiveShard()
getCurrentShard()
setCurrentShard()
getPage()
forceSave()
stopIndexingBundle()
countWordKeys()
getArchiveInfo()
setArchiveInfo()
getParamModifiedTime()
addPagesAndSeenKeys()
addFilters()
contains()
$dir_name
$description
$num_partitions_summaries
$generation_info
$num_docs_per_generation
$summaries
$dictionary
$current_shard
$version
$filter_a
$filter_b
NO_LOAD_SIZE
FORCE_ADVANCE_SIZE
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

NO_LOAD_SIZE

NO_LOAD_SIZE

Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard

FORCE_ADVANCE_SIZE

FORCE_ADVANCE_SIZE

Threshold index shard beyond which we force the generation to advance

Properties

$dir_name

$dir_name : string

Folder name to use for this IndexArchiveBundle

Type

string

$description

$description : string

A short text name for this IndexArchiveBundle

Type

string

$num_partitions_summaries

$num_partitions_summaries : integer

Number of partitions in the summaries WebArchiveBundle

Type

integer

$generation_info

$generation_info : array

structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).

Type

array

$num_docs_per_generation

$num_docs_per_generation : integer

Number of docs before a new generation is started

Type

integer

$summaries

$summaries : object

WebArchiveBundle for web page summaries

Type

object

$dictionary

$dictionary : object

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, .

..)

Type

object

$current_shard

$current_shard : object

Index Shard for current generation inverted word index

Type

object

$version

$version : integer

What version of index archive bundle this is

Type

integer

$filter_a

$filter_a : \seekquarry\yioop\library\BloomFilterFile

Used to store unique identifiers of feed itemms that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.

Type

\seekquarry\yioop\library\BloomFilterFile

$filter_b

$filter_b : \seekquarry\yioop\library\BloomFilterFile

Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.

@see $filter_a

Type

\seekquarry\yioop\library\BloomFilterFile

Methods

__construct()

__construct(string  $dir_name, boolean  $read_only_archive = true, string  $description = null, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION) 

Makes or initializes an FeedArchiveBundle with the provided parameters

Parameters

string $dir_name

folder name to store this bundle

boolean $read_only_archive

whether to open archive only for reading or reading and writing

string $description

a text name/serialized info about this IndexArchiveBundle

integer $num_docs_per_generation

the number of pages to be stored in a single shard

addPages()

addPages(integer  $generation, string  $offset_field, \seekquarry\yioop\library\array&  $pages, integer  $visited_urls_count) 

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

Parameters

integer $generation

field used to select partition

string $offset_field

field used to record offsets after storing

\seekquarry\yioop\library\array& $pages

data to store

integer $visited_urls_count

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

addIndexData()

addIndexData(object  $index_shard) 

Adds the provided mini inverted index data to the IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct

Parameters

object $index_shard

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

initGenerationToAdd()

initGenerationToAdd(integer  $add_num_docs, object  $callback = null, boolean  $blocking = false) : integer

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters

integer $add_num_docs

number of docs in the shard about to be added

object $callback

object with join function to be called if process is taking too long

boolean $blocking

whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Returns

integer —

the active generation after the check and possible change has been performed

addAdvanceGeneration()

addAdvanceGeneration(object  $callback = null) 

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

Parameters

object $callback

object with join function to be called if process is taking too long

addCurrentShardDictionary()

addCurrentShardDictionary(object  $callback = null) 

Adds the words from this shard to the dictionary

Parameters

object $callback

object with join function to be called if process is taking too long

getActiveShard()

getActiveShard() : object

Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard

Returns

object —

last shard in the bundle

getCurrentShard()

getCurrentShard(boolean  $force_read = false) : object

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

Parameters

boolean $force_read

whether to force no advance generation and merge dictionary side effects

Returns

object —

the currently being index shard

setCurrentShard()

setCurrentShard(  $i,   $disk_based = false) 

Sets the current shard to be the $i th shard in the index bundle.

Parameters

$i

which shard to set the current shard to be

$disk_based

whether to read the whole shard in before using or leave it on disk except for pages need

getPage()

getPage(integer  $offset, integer  $generation = -1) : array

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

Parameters

integer $offset

byte offset in partition of desired page

integer $generation

which generation WebArchive to look up in defaults to the same number as the current shard

Returns

array —

desired page

forceSave()

forceSave() 

Forces the current shard to be saved

stopIndexingBundle()

stopIndexingBundle() 

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

countWordKeys()

countWordKeys(array  $word_keys) : array

Computes the number of occurrences of each of the supplied list of word_keys

Parameters

array $word_keys

keys to compute counts for

Returns

array —

associative array of key => count values.

getArchiveInfo()

getArchiveInfo(string  $dir_name) : array

Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.

Parameters

string $dir_name

path to a directory containing a summaries WebArchiveBundle

Returns

array —

summary of the given archive

setArchiveInfo()

setArchiveInfo(string  $dir_name, array  $info) 

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

Parameters

string $dir_name

folder with archive bundle

array $info

struct with above fields

getParamModifiedTime()

getParamModifiedTime(string  $dir_name) 

Returns the last time the archive info of the bundle was modified.

Parameters

string $dir_name

folder with archive bundle

addPagesAndSeenKeys()

addPagesAndSeenKeys(integer  $generation, string  $offset_field, string  $key_field, \seekquarry\yioop\library\array&  $pages, integer  $visited_urls_count) 

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

Parameters

integer $generation

field used to select partition

string $offset_field

field used to record offsets after storing

string $key_field

field used to store unique identifier for a each page item.

\seekquarry\yioop\library\array& $pages

data to store

integer $visited_urls_count

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

addFilters()

addFilters(string  $key) 

Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.

Parameters

string $key

unique identifier of a feed item

contains()

contains(string  $key) : boolean

Whether the active filter for this feed contain thee feed item of thee supplied key

Parameters

string $key

the feed item id to check if in arcive

Returns

boolean —

true if it is in the archive, false otherwise