MAX_BUFFER_DOCS
MAX_BUFFER_DOCS
The maximum number of documents the ArcTool list function will read into memory in one go.
Command line program that allows one to examine the content of the WebArchiveBundles and IndexArchiveBundles of Yioop crawls.
To see all of the available command run it from the command line with a syntax like:
php ArcTool.php
outputInfo(string $archive_path)
Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.
Then outputs to stdout header information about the bundle by calling the appropriate sub-function.
string | $archive_path | The path of a directory that holds WebArchiveBundle,IndexArchiveBundle, or non-Yioop archive data |
outputDictInfo(string $archive_path, string $word, integer $start_generation, integer $num_generations)
Prints the IndexDictionary records for a word in an IndexArchiveBundle
string | $archive_path | the path of a directory that holds an IndexArchiveBundle |
string | $word | to look up dictionary record for |
integer | $start_generation | |
integer | $num_generations |
outputShardInfo(string $archive_path, integer $generation)
Prints information about the number of words and frequencies of words within the $generation'th index shard in the bundle
string | $archive_path | the path of a directory that holds an IndexArchiveBundle |
integer | $generation | which index shard to use |
outputCountIndexArchive(string $archive_path, boolean $set_count = false)
Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count
string | $archive_path | patch of archive to count |
boolean | $set_count | flag that controls whether after computing the count to write it back into the archive |
outputPostingInfo(string $archive_path, integer $generation, integer $offset, integer $num = 1)
Prints information about $num many postings beginning at the provided $generation and $offset
string | $archive_path | the path of a directory that holds an IndexArchiveBundle |
integer | $generation | which index shard to use |
integer | $offset | offset into posting lists for that shard |
integer | $num | how many postings to print info for |
reindexIndexArchive(string $path, integer $max_tier = -1, mixed $start_shard)
Used to recompute the dictionary of an index archive -- either from scratch using the index shard data or just using the current dictionary but merging the tiers into one tier
string | $path | file path to dictionary of an IndexArchiveBundle |
integer | $max_tier | tier up to which the dictionary tiers should be merge (typically a value greater than the max_tier of the dictionary) |
mixed | $start_shard | which shard to start shard from. If 'continue' then keeps goign from where last attempt at a rebuild was. |
outputInfoFeedArchiveBundle(array $info, string $archive_path, string $alternate_description = "", boolean $only_storage_info = false, boolean $only_crawl_params = false)
Outputs to stdout header information for a FeedArchiveBundle bundle.
array | $info | header info that has already been read from the description.txt file |
string | $archive_path | file path of the folder containing the bundle |
string | $alternate_description | used as the text for description rather than what's given in $info |
boolean | $only_storage_info | output only info about storage statistics don't output info about crawl parameters |
boolean | $only_crawl_params | output only info about crawl parameters not storage statistics |
outputInfoIndexArchiveBundle(array $info, string $archive_path, string $alternate_description = "", boolean $only_storage_info = false, boolean $only_crawl_params = false)
Outputs to stdout header information for a IndexArchiveBundle bundle.
array | $info | header info that has already been read from the description.txt file |
string | $archive_path | file path of the folder containing the bundle |
string | $alternate_description | used as the text for description rather than what's given in $info |
boolean | $only_storage_info | output only info about storage statistics don't output info about crawl parameters |
boolean | $only_crawl_params | output only info about crawl parameters not storage statistics |
outputInfoDoubleIndexBundle(array $info, string $archive_path)
Outputs to stdout header information for a DoubleIndexBundle bundle.
array | $info | header info that has already been read from the description.txt file |
string | $archive_path | file path of the folder containing the bundle |
outputInfoWebArchiveBundle(array $info, string $archive_path)
Outputs to stdout header information for a WebArchiveBundle bundle.
array | $info | header info that has already been read from the description.txt file |
string | $archive_path | file path of the folder containing the bundle |
inject(string $timestamp, string $url_file_name)
Adds a list of urls as a upcoming schedule for a given queue bundle.
Can be used to make a closed schedule startable
string | $timestamp | for a queue bundle to add urls to |
string | $url_file_name | name of file consist of urls to inject into the given crawl |
outputShowPages(string $archive_path, integer $start, integer $num)
Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.
string | $archive_path | path to bundle to list documents for |
integer | $start | first document to list |
integer | $num | number of documents to list |
rebuildIndexArchive(string $archive_path, mixed $start_generation)
Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.
Then a reindex is done.
string | $archive_path | file path to a IndexArchiveBundle |
mixed | $start_generation | which web archive generation to start rebuild from. If 'continue' then keeps goign from where last attempt at a rebuild was. |
instantiateIterator(string $archive_path, string $iterator_type)
Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.
string | $archive_path | path to non-yioop archive |
string | $iterator_type | name of archive_bundle_iterator used to iterate over archive. |
getArchiveKind(string $archive_path) : string
Given a folder name, determines the kind of bundle (if any) it holds.
It does this based on the expected location of the description.txt file, or arc_description.ini (in the case of a non-yioop archive)
string | $archive_path | the path to archive folder |
the archive bundle type, either: WebArchiveBundle or IndexArchiveBundle
badFormatMessageAndExit(string $archive_name, string $allowed_archives = "web or index")
Outputs the "hey, this isn't a known bundle message" and then exit()'s.
string | $archive_name | name or path to what was supposed to be an archive |
string | $allowed_archives | a string list of archives types that $archive_name could belong to |