\seekquarry\yioop\library\media_jobsFeedsUpdateJob

A media job to download and index feeds from various search sources (RSS, HTML scraper, etc). Idea is that this job runs once an hour to get the latest news, movies, weather from those sources.

Subclasses should implement methods they use among init(), checkPrerequisites(), nondistributedTasks(), prepareTasks(), finishTasks(), getTasks(), doTasks(), and putTask(). MediaUpdating can be configured to run in either distributed or nameserver only mode. In the former mode, prepareTasks(), finishTasks() run on the name server, getTasks() and putTask() run in the name server's web app, and doTasks() run on any MediaUpdater clients. In the latter mode, only the method nondistributedTasks() is called by the MediaUpdater and by only the updater on the name server.

Summary

Methods

Properties

Constants

__construct()
init()
run()
checkPrerequisites()
nondistributedTasks()
prepareTasks()
finishTasks()
doTasks()
getTasks()
putTasks()
execNameServer()
getJobName()
getCurrentMachine()
updateFoundItemsOneGo()
getTags()
addFoundItemsShard()
addFoundItemsInvertedIndex()
updateTrendingTermCounts()
addTermCountsTrendingTable()
addFeedItemIfNew()
calculateMetas()
convertJsonDecodeToTags()
parseFeedAuxInfo()
getFeedArchive()

$controller
$media_updater
$name_server_does_client_tasks
$name_server_does_client_tasks_only
$tasks
$update_time
$db
$index_archive
$found_items
$media_urls

MAX_FEEDS_ONE_GO
MAX_THUMBS_ONE_GO
OLD_ITEM_TIME

No protected methods found

No protected properties found

N/A

getThumbs()

No private properties found

N/A

Constants

MAX_FEEDS_ONE_GO

MAX_FEEDS_ONE_GO

Mamimum number of feeds to download in one try

MAX_THUMBS_ONE_GO

MAX_THUMBS_ONE_GO

Mamimum number of thumb_urls to download in one try

OLD_ITEM_TIME

OLD_ITEM_TIME

how long in seconds before a feed item expires

Properties

$controller

$controller :object

If MediaJob was instantiated in the web app, the controller that instatiated it

Type

object

$media_updater

$media_updater :object

If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater

Type

object

$name_server_does_client_tasks

$name_server_does_client_tasks :boolean

Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks

Type

boolean

$name_server_does_client_tasks_only

$name_server_does_client_tasks_only :boolean

Whether this MediaJob performs name server only tasks

Type

boolean

$tasks

$tasks :array

The most recently received from the name server tasks for this MediaJob

Type

array

$update_time

$update_time :integer

Time in current epoch when feeds last updated

Type

integer

$db

$db :object

Datasource object used to run db queries related to fes items (for storing and updating them)

Type

object

$index_archive

$index_archive :\seekquarry\yioop\library\media_jobs\IndexArchiveBundle

Type

\seekquarry\yioop\library\media_jobs\IndexArchiveBundle

$found_items

$found_items :array

Type

array

$media_urls

$media_urls :array

Type

array

Methods

__construct()

__construct(object  $media_updater = null,object  $controller = null)

Instiates the MediaJob with a reference to the object that instatiated it

Parameters

object	$media_updater	a reference to the media updater that instatiated this object (if being run in MediaUpdater)
object	$controller	a reference to the controller that instantiated this object (if being run in the web app)

init()

init()

Initializes the last update time to far in the past so, feeds will get immediately updated. Sets up connect to DB to store feeds items, and makes it so the same media job runs both on name server and client Media Updaters

run()

run()

Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overriden. Instead, the various callbacks it calls (listed in the class description) wshould be overriden.

checkPrerequisites()

checkPrerequisites(): boolean

Only update if its been more than an hour since the last update

Returns

boolean —

whether its been an hour since the last update

nondistributedTasks()

nondistributedTasks()

Get the media sources from the local database and use those to run the the same task as in the distributed setting

prepareTasks()

prepareTasks()

This method is called on the name server to prepare data for any MediaUpdater clients.

finishTasks()

finishTasks()

This method is called on the name server to finish processing any data returned by MediaUpdater clients.

doTasks()

doTasks(array  $tasks): mixed

For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexArchiveBundle for feeds

Parameters

array

$tasks

array of feed info (url to download, paths to extract etc)

Returns

mixed —

the result of carrying out that processing

getTasks()

getTasks(integer  $machine_id,array  $data = null): array

Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash

Parameters

integer	$machine_id	id of machine making request for feeds
array	$data	not used but inherited from the base MediaJob class as a parameter (so will alwasys be null in this case)

Returns

array —

of feed urls and paths to extract from them

putTasks()

putTasks(integer  $machine_id,mixed  $data): array

After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server

Parameters

integer	$machine_id	id of client that is sending data to name server
mixed	$data	results of computation done by client

Returns

array —

any response information to send back to the client

execNameServer()

execNameServer(string  $command,string  $args = null): array

Executes a method on the name server's JobController.

It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.

Parameters

string	$command	the method to invoke on the name server
string	$args	additional arguments to be passed to the name server

Returns

array —

data returned by the name server.

getJobName()

getJobName(): string

Gets the class name (less namespace and the word Job ) of the current MediaJob

Returns

string —

name of the current job

getCurrentMachine()

getCurrentMachine(): string

Returns a hash of the url of the current machine based on the value saved to current_machine_info.txt by a machine statuses request

Returns

string —

hash of current machine url

updateFoundItemsOneGo()

updateFoundItemsOneGo(array  $feeds,integer  $age = \seekquarry\yioop\configs\ONE_WEEK,boolean  $test_mode = false): mixed

Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.

Parameters

array	$feeds	list of feeds to download
integer	$age	how many seconds old records should be ignored
boolean	$test_mode	if true then rather then update items in database, returns as a string the found feed items for the given feeds

Returns

mixed —

either true, or if $test_mode is true then the results as a string of downloading the feeds and extracting the feed items

getTags()

getTags(\seekquarry\yioop\library\media_jobs\DOMDocument  $dom,string  $query): array

Returns an array of DOMDocuments for the nodes that match an xpath query on $dom, a DOMDocument

Parameters

\seekquarry\yioop\library\media_jobs\DOMDocument	$dom	document to run xpath query on
string	$query	xpath query to run

Returns

array —

of DOMDocuments one for each node matching the xpath query in the orginal DOMDocument

addFoundItemsShard()

addFoundItemsShard(integer  $age): boolean

Copies all feeds items newer than $age to a new shard, then deletes old index shard and database entries older than $age. Finally sets copied shard to be active. If this method is going to take max_execution_time/2 it returns false, so an additional job can be schedules; otherwise it returns true

Parameters

integer

$age

how many seconds old records should be deleted

Returns

boolean —

whether job executed to complete

addFoundItemsInvertedIndex()

addFoundItemsInvertedIndex(\seekquarry\yioop\library\IndexShard  $tmp_shard,array  $seen_sites,integer  $seen_url_count): boolean

Helper method for addFoundItemsShard(). Checks if the current shard is full or not and adds items to it.

Parameters

\seekquarry\yioop\library\IndexShard	$tmp_shard	a temporary shard holding all necessary information
array	$seen_sites	of the sites and their corresponding hash
integer	$seen_url_count	of how many sites have been seen before committing data to one shard

Returns

boolean —

whether job executed to complete

updateTrendingTermCounts()

updateTrendingTermCounts(\seekquarry\yioop\library\media_jobs\array&  $term_counts,string  $source_phrase,array  $word_or_phrase_list,string  $media_category,string  $source_name,string  $lang,integer  $pubdate,string  $source_stop_regex = "")

Updates trending term counts based on the string from the current feed item.

Parameters

\seekquarry\yioop\library\media_jobs\array&	$term_counts	lang => [term => occurrences]
string	$source_phrase	original non-stemmed phrase from feed item to adjust $term_counts with. Used to remember non-stemmed terms. We assume we have already extracted position lists from
array	$word_or_phrase_list	associate array of stemmed_word_or_phrase => positions in feed item of where occurs
string	$media_category	of feed source the item case from. We trending counts grouped by media category
string	$source_name	of feed source the item case from. We exclude from counts the name of the feed source
string	$lang	locale_tag for this feed item
integer	$pubdate	timestamp when string was published (used in weighting)
string	$source_stop_regex	a regex to remove terms which occur frequently for this paricular source

addTermCountsTrendingTable()

addTermCountsTrendingTable(resource  $db,array  $term_counts)

Updates TRENDING_TERM, hourly, daily, and weekly top term occurrences.

Removes entries older than a week

Parameters

resource	$db	handle to database with TRENDING_TERM table
array	$term_counts	for the most recent update of the feed index, it should be an array [$lang => [$term => $occurences]] for the top NUM_TRENDING terms per language

addFeedItemIfNew()

addFeedItemIfNew(array  $item,string  $source_name,string  $lang,integer  $age,  $unique_fields): boolean

Adds $item to feed index bundle if it isn't already there

Parameters

array	$item	data from a single feed item
string	$source_name	string name of the feed $item was found on
string	$lang	locale-tag of the feed
integer	$age	how many seconds old records should be ignored
	$unique_fields

Returns

boolean —

whether an item was added

calculateMetas()

calculateMetas(string  $lang,integer  $pubdate,string  $source_name,string  $guid,string  $media_category = "news"): array

Used to calculate the meta words for RSS feed items

Parameters

string	$lang	the locale_tag of the feed item
integer	$pubdate	UNIX timestamp publication date of item
string	$source_name	the name of the feed
string	$guid	the guid of the item
string	$media_category	determines what media: metas to inject. Default is news.

Returns

array —

$meta_ids meta words found

convertJsonDecodeToTags()

convertJsonDecodeToTags(array  $json_decode): string

Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.

Parameters

array

$json_decode

associative array coming from a json_decode'd string

Returns

string —

result of converting array to an html string

parseFeedAuxInfo()

parseFeedAuxInfo(  $feed)

Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is readed are html, json and regex feeds.

In the case of an rss or atom feed this method assumes the AUX_INFO field just contains an xpath expression for finding a feed_item's image, and so just parses the AUX_INFO field into an IMAGE_XPATH field.

Parameters

$feed

getFeedArchive()

getFeedArchive()

getThumbs()

getThumbs(array  $thumb_sites)

Parameters

array

$thumb_sites