MAX_FEEDS_ONE_GO
MAX_FEEDS_ONE_GO
Mamimum number of feeds to download in one try
A media job to download and index feeds from various search sources (RSS, HTML scraper, etc). Idea is that this job runs once an hour to get the latest news, movies, weather from those sources.
Subclasses should implement methods they use among init(), checkPrerequisites(), nondistributedTasks(), prepareTasks(), finishTasks(), getTasks(), doTasks(), and putTask(). MediaUpdating can be configured to run in either distributed or nameserver only mode. In the former mode, prepareTasks(), finishTasks() run on the name server, getTasks() and putTask() run in the name server's web app, and doTasks() run on any MediaUpdater clients. In the latter mode, only the method nondistributedTasks() is called by the MediaUpdater and by only the updater on the name server.
__construct(object $media_updater = null,object $controller = null)
Instiates the MediaJob with a reference to the object that instatiated it
object | $media_updater | a reference to the media updater that instatiated this object (if being run in MediaUpdater) |
object | $controller | a reference to the controller that instantiated this object (if being run in the web app) |
doTasks(array $tasks): mixed
For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexArchiveBundle for feeds
array | $tasks | array of feed info (url to download, paths to extract etc) |
the result of carrying out that processing
getTasks(integer $machine_id,array $data = null): array
Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash
integer | $machine_id | id of machine making request for feeds |
array | $data | not used but inherited from the base MediaJob class as a parameter (so will alwasys be null in this case) |
of feed urls and paths to extract from them
putTasks(integer $machine_id,mixed $data): array
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
integer | $machine_id | id of client that is sending data to name server |
mixed | $data | results of computation done by client |
any response information to send back to the client
execNameServer(string $command,string $args = null): array
Executes a method on the name server's JobController.
It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.
string | $command | the method to invoke on the name server |
string | $args | additional arguments to be passed to the name server |
data returned by the name server.
updateFoundItemsOneGo(array $feeds,integer $age = \seekquarry\yioop\configs\ONE_WEEK,boolean $test_mode = false): mixed
Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.
array | $feeds | list of feeds to download |
integer | $age | how many seconds old records should be ignored |
boolean | $test_mode | if true then rather then update items in database, returns as a string the found feed items for the given feeds |
either true, or if $test_mode is true then the results as a string of downloading the feeds and extracting the feed items
getTags(\seekquarry\yioop\library\media_jobs\DOMDocument $dom,string $query): array
Returns an array of DOMDocuments for the nodes that match an xpath query on $dom, a DOMDocument
\seekquarry\yioop\library\media_jobs\DOMDocument | $dom | document to run xpath query on |
string | $query | xpath query to run |
of DOMDocuments one for each node matching the xpath query in the orginal DOMDocument
addFoundItemsShard(integer $age): boolean
Copies all feeds items newer than $age to a new shard, then deletes old index shard and database entries older than $age. Finally sets copied shard to be active. If this method is going to take max_execution_time/2 it returns false, so an additional job can be schedules; otherwise it returns true
integer | $age | how many seconds old records should be deleted |
whether job executed to complete
addFoundItemsInvertedIndex(\seekquarry\yioop\library\IndexShard $tmp_shard,array $seen_sites,integer $seen_url_count): boolean
Helper method for addFoundItemsShard(). Checks if the current shard is full or not and adds items to it.
\seekquarry\yioop\library\IndexShard | $tmp_shard | a temporary shard holding all necessary information |
array | $seen_sites | of the sites and their corresponding hash |
integer | $seen_url_count | of how many sites have been seen before committing data to one shard |
whether job executed to complete
updateTrendingTermCounts(\seekquarry\yioop\library\media_jobs\array& $term_counts,string $source_phrase,array $word_or_phrase_list,string $media_category,string $source_name,string $lang,integer $pubdate,string $source_stop_regex = "")
Updates trending term counts based on the string from the current feed item.
\seekquarry\yioop\library\media_jobs\array& | $term_counts | lang => [term => occurrences] |
string | $source_phrase | original non-stemmed phrase from feed item to adjust $term_counts with. Used to remember non-stemmed terms. We assume we have already extracted position lists from |
array | $word_or_phrase_list | associate array of stemmed_word_or_phrase => positions in feed item of where occurs |
string | $media_category | of feed source the item case from. We trending counts grouped by media category |
string | $source_name | of feed source the item case from. We exclude from counts the name of the feed source |
string | $lang | locale_tag for this feed item |
integer | $pubdate | timestamp when string was published (used in weighting) |
string | $source_stop_regex | a regex to remove terms which occur frequently for this paricular source |
addTermCountsTrendingTable(resource $db,array $term_counts)
Updates TRENDING_TERM, hourly, daily, and weekly top term occurrences.
Removes entries older than a week
resource | $db | handle to database with TRENDING_TERM table |
array | $term_counts | for the most recent update of the feed index, it should be an array [$lang => [$term => $occurences]] for the top NUM_TRENDING terms per language |
addFeedItemIfNew(array $item,string $source_name,string $lang,integer $age, $unique_fields): boolean
Adds $item to feed index bundle if it isn't already there
array | $item | data from a single feed item |
string | $source_name | string name of the feed $item was found on |
string | $lang | locale-tag of the feed |
integer | $age | how many seconds old records should be ignored |
$unique_fields |
whether an item was added
calculateMetas(string $lang,integer $pubdate,string $source_name,string $guid,string $media_category = "news"): array
Used to calculate the meta words for RSS feed items
string | $lang | the locale_tag of the feed item |
integer | $pubdate | UNIX timestamp publication date of item |
string | $source_name | the name of the feed |
string | $guid | the guid of the item |
string | $media_category | determines what media: metas to inject. Default is news. |
$meta_ids meta words found
convertJsonDecodeToTags(array $json_decode): string
Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.
array | $json_decode | associative array coming from a json_decode'd string |
result of converting array to an html string
parseFeedAuxInfo( $feed)
Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is readed are html, json and regex feeds.
In the case of an rss or atom feed this method assumes the AUX_INFO field just contains an xpath expression for finding a feed_item's image, and so just parses the AUX_INFO field into an IMAGE_XPATH field.
$feed |