NUM_TIMES_INTERVAL
NUM_TIMES_INTERVAL
For size and time distributions the number of times the miminal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time
A media job used to periodically calculate summary statistics about group, thread, page, and query impressions.
Subclasses should implement methods they use among init(), checkPrerequisites(), nondistributedTasks(), prepareTasks(), finishTasks(), getTasks(), doTasks(), and putTask(). MediaUpdating can be configured to run in either distributed or nameserver only mode. In the former mode, prepareTasks(), finishTasks() run on the name server, getTasks() and putTask() run in the name server's web app, and doTasks() run on any MediaUpdater clients. In the latter mode, only the method nondistributedTasks() is called by the MediaUpdater and by only the updater on the name server.
__construct(object $media_updater = null, object $controller = null)
Instiates the MediaJob with a reference to the object that instatiated it
object | $media_updater | a reference to the media updater that instatiated this object (if being run in MediaUpdater) |
object | $controller | a reference to the controller that instantiated this object (if being run in the web app) |
getTasks(integer $machine_id, array $data = null) : array
Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
integer | $machine_id | id of client requesting data |
array | $data | any additional info about data being requested |
work for the client to process
putTasks(integer $machine_id, mixed $data) : array
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
integer | $machine_id | id of client that is sending data to name server |
mixed | $data | results of computation done by client |
any response information to send back to the client
execNameServer(string $command, string $args = null) : array
Executes a method on the name server's JobController.
It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.
string | $command | the method to invoke on the name server |
string | $args | additional arguments to be passed to the name server |
data returned by the name server.
computeCrawlStatistics()
Runs the queries neccessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.
If these queries take too long it saves partial results and returns.
countQuery(string $query, string $index_timestamp, array $machine_urls) : integer
Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.
string | $query | to use and count the results of |
string | $index_timestamp | timestamp of index to compute query count for |
array | $machine_urls | queue servers on which the count is to be computed |
number of results that would be returned by the given query