$plugin_instances
$plugin_instances : array
indexing_plugins which might be used with the current processor
Base class common to all processors of web page data
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, integer $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
integer | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
process(string $page, string $url) : array
Should be implemented to compute a summary based on a text string of a document. This method is called from
string | $page | string of a document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page