$plugin_instances
$plugin_instances :array
indexing_plugins which might be used with the current processor
Used to create crawl summary information for PDF files
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(),integer $max_description_len = null,string $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
string | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER |
process(string $page,string $url): \seekquarry\yioop\library\processors\a
Used to extract the title, description and links from a string consisting of PDF data.
string | $page | a string consisting of web-page contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
summary of the contents of the page
calculateLang(string $sample_text = null,string $url = null): string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
string | $sample_text | sample text to try guess the language from |
string | $url | url of web-page as a fallback look at the country to figure out language |
language tag for guessed language
getBetweenTags(string $string,integer $cur_pos,string $start_tag,string $end_tag): array
Gets the text between two tags in a document starting at the current position.
string | $string | document to extract text from |
integer | $cur_pos | current location to look if can extract text |
string | $start_tag | starting tag that we want to extract after |
string | $end_tag | ending tag that we want to extract until |
pair consisting of when in the document we are after the end tag, together with the data between the two tags
extractHttpHttpsUrls(string $page): array
Tries to extract http or https links from a string of text.
Does this by a very approximate regular expression.
string | $page | text string of a document |
a set of http or https links that were extracted from the document
closeDanglingTags(\seekquarry\yioop\library\processors\string& $page)
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
\seekquarry\yioop\library\processors\string& | $page | a reference to an xml or html document |
handle(string $page,string $url): array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
getEncodingTitle(string $pdf_string): array
Returns the first encoding format information found in the PDF document
string | $pdf_string | a string representing the PDF document |
[encoding, title] which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc as well as a title for the document if found
getText(string $pdf_string, $url,string $encoding = ""): string
Gets the text out of a PDF document
string | $pdf_string | a string representing the PDF document |
$url | the url where the page contents came from, used to canonicalize relative links |
|
string | $encoding | which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc. |
text extracted from the document
getNextObject(string $pdf_string,integer $cur_pos): string
Gets between an obj and endobj tag at the current position in a PDF document
string | $pdf_string | astring of a PDF document |
integer | $cur_pos | a integer postion in that string |
the contents of the PDF object located at $cur_pos
objectDictionaryHas(string $object_dictionary,array $type_array): \seekquarry\yioop\library\processors\whether
Checks if the PDF object's object dictionary is in a list of types
string | $object_dictionary | the object dictionary to check |
array | $type_array | the list of types to check against |
it is in or not
parseText(string $data,string $encoding = ""): string
Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.
string | $data | source to extract character data from |
string | $encoding | which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc. |
extracted text
parseBrackets(string $data,integer $cur_pos,string $encoding = ""): array
Extracts text till the next close brackets
string | $data | source to extract character data from |
integer | $cur_pos | position to start in $data |
string | $encoding | which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc. |
pair consisting of the final position in $data as well as extracted text
parseParentheses(string $data,integer $cur_pos,string $encoding): array
Extracts ASCII text till the next close parenthesis
string | $data | source to extract character data from |
integer | $cur_pos | position to start in $data |
string | $encoding | which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc. |
pair consisting of the final position in $data as well as extracted text
convertChar(\seekquarry\yioop\library\processors\char $cur_char,string $encoding): string
Used to convert characters from one of the built in PDF encodings to UTF-8
\seekquarry\yioop\library\processors\char | $cur_char | character to conver |
string | $encoding | which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc. |
resultign converted string for character