$plugin_instances
$plugin_instances : array
indexing_plugins which might be used with the current processor
Base abstract class common to all processors used to create crawl summary information from images
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, integer $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
integer | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
process(string $page, string $url) : array
Extract summary data from the image provided in $page together the url in $url where it was downloaded from
ImageProcessor class defers a proper implementation of this method to subclasses
string | $page | the image represented as a character string |
string | $url | the url where the image was downloaded from |
summary information including a thumbnail and a description (where the description is just the url)
saveTempFile(string $page, string $url, string $file_extension)
Used to save a temporary file with the data downloaded for a url while carrying out image processing
string | $page | contains data about an image that one needs to save |
string | $url | where $page data came from |
string | $file_extension | to be associated wit the $page data |
addWidthHeightSummary( $summary, string $image_string) : array
Given an $image_string determines if possible its width and height then assigns the values into the CrawlConstants:WIDTH, CrawlConstants:HEIGHT fields of $summary
$summary | ||
string | $image_string | the image represented as a character string |
summary information including a thumbnail and a description (where the description is just the url)