\seekquarry\yioop\executablesArcTool

Command line program that allows one to examine the content of the WebArchiveBundles and IndexArchiveBundles of Yioop crawls.

To see all of the available command run it from the command line with a syntax like:

php ArcTool.php

Summary

Methods
Properties
Constants
__construct()
start()
outputArchiveList()
outputInfo()
outputDictInfo()
outputShardInfo()
outputCountIndexArchive()
outputPostingInfo()
getArchiveName()
reindexIndexArchive()
outputInfoFeedArchiveBundle()
outputInfoIndexArchiveBundle()
outputInfoDoubleIndexBundle()
outputInfoWebArchiveBundle()
inject()
outputShowPages()
rebuildIndexArchive()
instantiateIterator()
getArchiveKind()
badFormatMessageAndExit()
usageMessageAndExit()
No public properties found
MAX_BUFFER_DOCS
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

MAX_BUFFER_DOCS

MAX_BUFFER_DOCS

The maximum number of documents the ArcTool list function will read into memory in one go.

Methods

__construct()

__construct() 

Initializes the ArcTool, for now does nothing

start()

start() 

Runs the ArcTool on the supplied command line arguments

outputArchiveList()

outputArchiveList() 

Lists the Web or IndexArchives in the crawl directory

outputInfo()

outputInfo(string  $archive_path) 

Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.

Then outputs to stdout header information about the bundle by calling the appropriate sub-function.

Parameters

string $archive_path

The path of a directory that holds WebArchiveBundle,IndexArchiveBundle, or non-Yioop archive data

outputDictInfo()

outputDictInfo(string  $archive_path, string  $word, integer  $start_generation, integer  $num_generations) 

Prints the IndexDictionary records for a word in an IndexArchiveBundle

Parameters

string $archive_path

the path of a directory that holds an IndexArchiveBundle

string $word

to look up dictionary record for

integer $start_generation
integer $num_generations

outputShardInfo()

outputShardInfo(string  $archive_path, integer  $generation) 

Prints information about the number of words and frequencies of words within the $generation'th index shard in the bundle

Parameters

string $archive_path

the path of a directory that holds an IndexArchiveBundle

integer $generation

which index shard to use

outputCountIndexArchive()

outputCountIndexArchive(string  $archive_path, boolean  $set_count = false) 

Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count

Parameters

string $archive_path

patch of archive to count

boolean $set_count

flag that controls whether after computing the count to write it back into the archive

outputPostingInfo()

outputPostingInfo(string  $archive_path, integer  $generation, integer  $offset, integer  $num = 1) 

Prints information about $num many postings beginning at the provided $generation and $offset

Parameters

string $archive_path

the path of a directory that holds an IndexArchiveBundle

integer $generation

which index shard to use

integer $offset

offset into posting lists for that shard

integer $num

how many postings to print info for

getArchiveName()

getArchiveName(string  $archive_path) : string

Given a complete path to an archive returns its filename

Parameters

string $archive_path

a path to a yioop or non-yioop archive

Returns

string —

its filename

reindexIndexArchive()

reindexIndexArchive(string  $path, integer  $max_tier = -1, mixed  $start_shard) 

Used to recompute the dictionary of an index archive -- either from scratch using the index shard data or just using the current dictionary but merging the tiers into one tier

Parameters

string $path

file path to dictionary of an IndexArchiveBundle

integer $max_tier

tier up to which the dictionary tiers should be merge (typically a value greater than the max_tier of the dictionary)

mixed $start_shard

which shard to start shard from. If 'continue' then keeps goign from where last attempt at a rebuild was.

outputInfoFeedArchiveBundle()

outputInfoFeedArchiveBundle(array  $info, string  $archive_path, string  $alternate_description = "", boolean  $only_storage_info = false, boolean  $only_crawl_params = false) 

Outputs to stdout header information for a FeedArchiveBundle bundle.

Parameters

array $info

header info that has already been read from the description.txt file

string $archive_path

file path of the folder containing the bundle

string $alternate_description

used as the text for description rather than what's given in $info

boolean $only_storage_info

output only info about storage statistics don't output info about crawl parameters

boolean $only_crawl_params

output only info about crawl parameters not storage statistics

outputInfoIndexArchiveBundle()

outputInfoIndexArchiveBundle(array  $info, string  $archive_path, string  $alternate_description = "", boolean  $only_storage_info = false, boolean  $only_crawl_params = false) 

Outputs to stdout header information for a IndexArchiveBundle bundle.

Parameters

array $info

header info that has already been read from the description.txt file

string $archive_path

file path of the folder containing the bundle

string $alternate_description

used as the text for description rather than what's given in $info

boolean $only_storage_info

output only info about storage statistics don't output info about crawl parameters

boolean $only_crawl_params

output only info about crawl parameters not storage statistics

outputInfoDoubleIndexBundle()

outputInfoDoubleIndexBundle(array  $info, string  $archive_path) 

Outputs to stdout header information for a DoubleIndexBundle bundle.

Parameters

array $info

header info that has already been read from the description.txt file

string $archive_path

file path of the folder containing the bundle

outputInfoWebArchiveBundle()

outputInfoWebArchiveBundle(array  $info, string  $archive_path) 

Outputs to stdout header information for a WebArchiveBundle bundle.

Parameters

array $info

header info that has already been read from the description.txt file

string $archive_path

file path of the folder containing the bundle

inject()

inject(string  $timestamp, string  $url_file_name) 

Adds a list of urls as a upcoming schedule for a given queue bundle.

Can be used to make a closed schedule startable

Parameters

string $timestamp

for a queue bundle to add urls to

string $url_file_name

name of file consist of urls to inject into the given crawl

outputShowPages()

outputShowPages(string  $archive_path, integer  $start, integer  $num) 

Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.

Parameters

string $archive_path

path to bundle to list documents for

integer $start

first document to list

integer $num

number of documents to list

rebuildIndexArchive()

rebuildIndexArchive(string  $archive_path, mixed  $start_generation) 

Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.

Then a reindex is done.

Parameters

string $archive_path

file path to a IndexArchiveBundle

mixed $start_generation

which web archive generation to start rebuild from. If 'continue' then keeps goign from where last attempt at a rebuild was.

instantiateIterator()

instantiateIterator(string  $archive_path, string  $iterator_type) 

Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.

Parameters

string $archive_path

path to non-yioop archive

string $iterator_type

name of archive_bundle_iterator used to iterate over archive.

getArchiveKind()

getArchiveKind(string  $archive_path) : string

Given a folder name, determines the kind of bundle (if any) it holds.

It does this based on the expected location of the description.txt file, or arc_description.ini (in the case of a non-yioop archive)

Parameters

string $archive_path

the path to archive folder

Returns

string —

the archive bundle type, either: WebArchiveBundle or IndexArchiveBundle

badFormatMessageAndExit()

badFormatMessageAndExit(string  $archive_name, string  $allowed_archives = "web or index") 

Outputs the "hey, this isn't a known bundle message" and then exit()'s.

Parameters

string $archive_name

name or path to what was supposed to be an archive

string $allowed_archives

a string list of archives types that $archive_name could belong to

usageMessageAndExit()

usageMessageAndExit() 

Outputs the "how to use this tool message" and then exit()'s.