$plugin_instances
$plugin_instances : array
indexing_plugins which might be used with the current processor
Used to create crawl summary information for DOCX files
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, string $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
string | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER |
process(string $page, string $url) : array
Used to extract the title, description and links from a docx file consisting of xml data.
string | $page | docx(zip) contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
a summary of the contents of the page
calculateLang(string $sample_text = null, string $url = null) : string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
string | $sample_text | sample text to try guess the language from |
string | $url | url of web-page as a fallback look at the country to figure out language |
language tag for guessed language
getBetweenTags(string $string, integer $cur_pos, string $start_tag, string $end_tag) : array
Gets the text between two tags in a document starting at the current position.
string | $string | document to extract text from |
integer | $cur_pos | current location to look if can extract text |
string | $start_tag | starting tag that we want to extract after |
string | $end_tag | ending tag that we want to extract until |
pair consisting of when in the document we are after the end tag, together with the data between the two tags
extractHttpHttpsUrls(string $page) : array
Tries to extract http or https links from a string of text.
Does this by a very approximate regular expression.
string | $page | text string of a document |
a set of http or https links that were extracted from the document
closeDanglingTags(\seekquarry\yioop\library\processors\string& $page)
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
\seekquarry\yioop\library\processors\string& | $page | a reference to an xml or html document |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
links(object $dom, string $site) : array
Returns up to MAX_LINK_PER_PAGE many links from the supplied dom object where links have been canonicalized according to the supplied $site information.
object | $dom | a document object with links on it |
string | $site | a string containing a url |
links from the $dom object