MAX_DOM_LEVEL
MAX_DOM_LEVEL
The constant represents the number of child levels at which the data is present in the content.opf file.
Used to create crawl summary information for XML files (those served as application/epub+zip)
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
__construct(array $plugins = array(), integer $max_description_len = null, string $summarizer_option = self::BASIC_SUMMARIZER)
Set-ups the any indexing plugins associated with this page processor
array | $plugins | an array of indexing plugins which might do further processing on the data handles by this page processor |
integer | $max_description_len | maximal length of a page summary |
string | $summarizer_option | CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER |
process(string $page, string $url) : array
Used to extract the title, description and links from a string consisting of ebook publication data.
string | $page | epub contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
a summary of the contents of the page
calculateLang(string $sample_text = null, string $url = null) : string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
string | $sample_text | sample text to try guess the language from |
string | $url | url of web-page as a fallback look at the country to figure out language |
language tag for guessed language
getBetweenTags(string $string, integer $cur_pos, string $start_tag, string $end_tag) : array
Gets the text between two tags in a document starting at the current position.
string | $string | document to extract text from |
integer | $cur_pos | current location to look if can extract text |
string | $start_tag | starting tag that we want to extract after |
string | $end_tag | ending tag that we want to extract until |
pair consisting of when in the document we are after the end tag, together with the data between the two tags
extractHttpHttpsUrls(string $page) : array
Tries to extract http or https links from a string of text.
Does this by a very approximate regular expression.
string | $page | text string of a document |
a set of http or https links that were extracted from the document
closeDanglingTags(\seekquarry\yioop\library\processors\string& $page)
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
\seekquarry\yioop\library\processors\string& | $page | a reference to an xml or html document |
handle(string $page, string $url) : array
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
string | $page | string of a web document |
string | $url | location the document came from |
a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
xmlToObject(string $xml) : array
Used to extract the DOM tree containing the information about the epub file such as title, author, language, unique identifier of the book from a string consisting of ebook publication content OPF file.
string | $xml | page contents |
an information about the contents of the page