\seekquarry\yioop\library\media_jobsAnalyticsJob

A media job used to periodically calculate summary statistics about group, thread, page, and query impressions.

Subclasses should implement methods they use among init(), checkPrerequisites(), nondistributedTasks(), prepareTasks(), finishTasks(), getTasks(), doTasks(), and putTask(). MediaUpdating can be configured to run in either distributed or nameserver only mode. In the former mode, prepareTasks(), finishTasks() run on the name server, getTasks() and putTask() run in the name server's web app, and doTasks() run on any MediaUpdater clients. In the latter mode, only the method nondistributedTasks() is called by the MediaUpdater and by only the updater on the name server.

Summary

Methods
Properties
Constants
__construct()
init()
run()
checkPrerequisites()
nondistributedTasks()
prepareTasks()
finishTasks()
doTasks()
getTasks()
putTasks()
execNameServer()
getJobName()
getCurrentMachine()
computeCrawlStatistics()
countQuery()
$controller
$media_updater
$name_server_does_client_tasks
$name_server_does_client_tasks_only
$tasks
$update_time
$impression_model
$phrase_model
$machine_model
NUM_TIMES_INTERVAL
STATISTIC_REFRESH_RATE
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

NUM_TIMES_INTERVAL

NUM_TIMES_INTERVAL

For size and time distributions the number of times the miminal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time

STATISTIC_REFRESH_RATE

STATISTIC_REFRESH_RATE

While computing the statistics page, number of seconds until a page refresh and save of progress so far occurs

Properties

$controller

$controller : object

If MediaJob was instantiated in the web app, the controller that instatiated it

Type

object

$media_updater

$media_updater : object

If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater

Type

object

$name_server_does_client_tasks

$name_server_does_client_tasks : boolean

Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks

Type

boolean

$name_server_does_client_tasks_only

$name_server_does_client_tasks_only : boolean

Whether this MediaJob performs name server only tasks

Type

boolean

$tasks

$tasks : array

The most recently received from the name server tasks for this MediaJob

Type

array

$update_time

$update_time : integer

Time in current epoch when analytics last updated

Type

integer

$impression_model

$impression_model : object

Used to get statistics from DBMS about wiki and thread views

Type

object

$phrase_model

$phrase_model : object

Used to get crawl statistics about the number of various HTTP response requests seen during a crawl

Type

object

$machine_model

$machine_model : object

Used to determine which queue servers are available and which might have information about a crawl

Type

object

Methods

__construct()

__construct(object  $media_updater = null, object  $controller = null) 

Instiates the MediaJob with a reference to the object that instatiated it

Parameters

object $media_updater

a reference to the media updater that instatiated this object (if being run in MediaUpdater)

object $controller

a reference to the controller that instantiated this object (if being run in the web app)

init()

init() 

Initializes the time of last analytics update

run()

run() 

Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overriden. Instead, the various callbacks it calls (listed in the class description) wshould be overriden.

checkPrerequisites()

checkPrerequisites() : boolean

Only update if its been more than an hour since the last update

Returns

boolean —

whether its been an hour since the last update

nondistributedTasks()

nondistributedTasks() 

For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.

prepareTasks()

prepareTasks() 

This method is called on the name server to prepare data for any MediaUpdater clients.

finishTasks()

finishTasks() 

This method is called on the name server to finish processing any data returned by MediaUpdater clients.

doTasks()

doTasks(array  $tasks) : mixed

Calls ImpressionModel to actually calculate various impression totals since the last update

Parameters

array $tasks

array of info that came from getTasks (in this nothing)

Returns

mixed —

the result of carrying out that processing

getTasks()

getTasks(integer  $machine_id, array  $data = null) : array

Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.

Parameters

integer $machine_id

id of client requesting data

array $data

any additional info about data being requested

Returns

array —

work for the client to process

putTasks()

putTasks(integer  $machine_id, mixed  $data) : array

After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server

Parameters

integer $machine_id

id of client that is sending data to name server

mixed $data

results of computation done by client

Returns

array —

any response information to send back to the client

execNameServer()

execNameServer(string  $command, string  $args = null) : array

Executes a method on the name server's JobController.

It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.

Parameters

string $command

the method to invoke on the name server

string $args

additional arguments to be passed to the name server

Returns

array —

data returned by the name server.

getJobName()

getJobName() : string

Gets the class name (less namespace and the word Job ) of the current MediaJob

Returns

string —

name of the current job

getCurrentMachine()

getCurrentMachine() : string

Returns a hash of the url of the current machine based on the value saved to current_machine_info.txt by a machine statuses request

Returns

string —

hash of current machine url

computeCrawlStatistics()

computeCrawlStatistics() 

Runs the queries neccessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.

If these queries take too long it saves partial results and returns.

countQuery()

countQuery(string  $query, string  $index_timestamp, array  $machine_urls) : integer

Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.

Parameters

string $query

to use and count the results of

string $index_timestamp

timestamp of index to compute query count for

array $machine_urls

queue servers on which the count is to be computed

Returns

integer —

number of results that would be returned by the given query