$plugin_instances
$plugin_instances : array
indexing_plugins which might be used with the current processor
Used to create crawl summary information for binary DOC files
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, string $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
string | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER |
process(string $page, string $url) : array
Used to extract the title, description and links from a string consisting of Word Doc data (2004 or earlier).
string | $page | the web-page contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
a summary of the contents of the page
calculateLang(string $sample_text = null, string $url = null) : string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
string | $sample_text | sample text to try guess the language from |
string | $url | url of web-page as a fallback look at the country to figure out language |
language tag for guessed language
getBetweenTags(string $string, integer $cur_pos, string $start_tag, string $end_tag) : array
Gets the text between two tags in a document starting at the current position.
string | $string | document to extract text from |
integer | $cur_pos | current location to look if can extract text |
string | $start_tag | starting tag that we want to extract after |
string | $end_tag | ending tag that we want to extract until |
pair consisting of when in the document we are after the end tag, together with the data between the two tags
extractHttpHttpsUrls(string $page) : array
Tries to extract http or https links from a string of text.
Does this by a very approximate regular expression.
string | $page | text string of a document |
a set of http or https links that were extracted from the document
closeDanglingTags(\seekquarry\yioop\library\processors\string& $page)
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
\seekquarry\yioop\library\processors\string& | $page | a reference to an xml or html document |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
extractASCIIText(string $doc)
This is the main text from Word doc extractor A Word Doc consists of a FIB, Piece Table, and DocumentStream. The last contains the text.
The piece table is supposed to be used to reconstruct the order of the text from the DocumentStream and the FIB, file information block,is supposed to tell us where the piece table is. I am not using any of this for now. I am just brute force looking for the text which I know has to be at a page (256 byte) boundary. I then go until I no longer see ASCII. So the order of text extracted might be screwed up right now.
string | $doc | string data of a 2004 or earlier Word doc |
checkPageForText(string $doc, integer $pos) : \seekquarry\yioop\library\processors\whether
Scans document starting at given position and looking forward eight character to see if these are ASCII printable or not.
string | $doc | document to scan |
integer | $pos | position to start scanning |
the eight next characters were ASCII printable
checkAllZeros(string $doc, integer $pos) : \seekquarry\yioop\library\processors\whether
Scans document starting at given position and looking forward eight character to see if these are all \0 or not.
string | $doc | document to scan |
integer | $pos | position to start scanning |
the eight next characters were \0
cleanTextBlock(string $doc, integer $pos) : \seekquarry\yioop\library\processors\substring
Scans document starting at given position forward eight character returning those characters which are ASCII printable
string | $doc | document to scan |
integer | $pos | position to start scanning |
of ASCII printable characters