\seekquarry\yioop\library\archive_bundle_iteratorsDatabaseBundleIterator

Used to iterate through the records that result from an SQL query to a database

Summary

Methods
Properties
Constants
saveCheckpoint()
restoreCheckpoint()
seekPage()
weight()
nextPages()
reset()
__construct()
saveCheckPoint()
restoreCheckPoint()
$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
$iterate_dir
$sql
$column_separator
$field_value_separator
$encoding
$db
$limit
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Timestamp of the archive that is being used to store results in

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

$iterate_dir

$iterate_dir : string

The path to the directory containing the archive partitions to be iterated over.

Type

string

$sql

$sql : string

SQL query whose records we are index

Type

string

$column_separator

$column_separator : string

DB Records are imported as a text string where column_separator is used to delimit the end of a column

Type

string

$field_value_separator

$field_value_separator : string

For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column

Type

string

$encoding

$encoding : string

What character encoding is used for the DB records

Type

string

$db

$db : resource

File handle for current arc file

Type

resource

$limit

$limit : integer

Current result row of query iterator has processed to

Type

integer

Methods

saveCheckpoint()

saveCheckpoint(array  $info = array()) 

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

Parameters

array $info

any extra info a subclass wants to save

restoreCheckpoint()

restoreCheckpoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit) 

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

weight()

weight(  $site) : boolean

Estimates the important of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

boolean —

false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.

Parameters

integer $num

number of docs to get

boolean $no_process

do not do any processing on page data

Returns

array —

associative arrays for $num pages

reset()

reset() 

Resets the iterator to the start of the archive bundle

__construct()

__construct(string  $iterate_timestamp, string  $iterate_dir, string  $result_timestamp, string  $result_dir) 

Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.

Parameters

string $iterate_timestamp

timestamp of the arc archive bundle to iterate over the pages of

string $iterate_dir

folder of files to iterate over

string $result_timestamp

timestamp of the arc archive bundle results are being stored in

string $result_dir

where to write last position checkpoints to

saveCheckPoint()

saveCheckPoint(array  $info = array()) 

Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.

Parameters

array $info

any extra info a subclass wants to save

restoreCheckPoint()

restoreCheckPoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.

Returns

array —

the data serialized when saveCheckpoint was called