$plugin_instances
$plugin_instances : array
indexing_plugins which might be used with the current processor
Processor class used to extract information from robots.txt files
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, string $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
string | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
process(string $page, string $url) : array
Parses the contents of a robots.txt page extracting allowed, disallowed paths, crawl-delay, and sitemaps. We also extract a list of all user agent strings seen.
string | $page | text string of a document |
string | $url | location the document came from, not used by TextProcessor at this point. Some of its subclasses override this method and use url to produce complete links for relative links within a document |
a summary of (title, description, links, and content) of the information in $page
makeCanonicalRobotPath(string $path) : string
Converts a path in a robots.txt file into a standard form usable by Yioop For robot paths foo is treated the same as /foo Path might contain urlencoded characters. These are all decoded except for %2F which corresponds to a / (this is as per http://www.robotstxt.org/norobots-rfc.txt)
string | $path | to convert |
Yioop canonical path