\seekquarry\yioop\library\processorsEpubProcessor

Used to create crawl summary information for XML files (those served as application/epub+zip)

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Summary

Methods
Properties
Constants
__construct()
process()
calculateLang()
getBetweenTags()
extractHttpHttpsUrls()
closeDanglingTags()
dom()
handle()
initializeIndexedFileTypes()
xmlToObject()
$plugin_instances
$summarizer
$summarizer_option
$max_description_len
$mime_processor
$image_types
$indexed_file_types
$name
$attributes
$content
$children
MAX_DOM_LEVEL
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

MAX_DOM_LEVEL

MAX_DOM_LEVEL

The constant represents the number of child levels at which the data is present in the content.opf file.

Properties

$plugin_instances

$plugin_instances : array

indexing_plugins which might be used with the current processor

Type

array

$summarizer

$summarizer : object

Stores the summarizer object used by this instance of page processor to be used in generating a summary

Type

object

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$mime_processor

$mime_processor : array

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

Type

array

$image_types

$image_types : array

Array filetypes which should be considered images.

Sub-classes add to this array with the types they handle

Type

array

$indexed_file_types

$indexed_file_types : array

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

Sub-classes add to this array with the types they handle

Type

array

$name

$name : string

The name of the tag element in an xml document

Type

string — name

$attributes

$attributes : string

The attribute of the tag element in an xml document

Type

string — attributes

$content

$content : string

The content of the tag element or attribute, used to extract the fields like title, creator, language of the document

Type

string — content

$children

$children : string

The child tag element of a tag element.

Type

string — children

Methods

__construct()

__construct(array  $plugins = array(), integer  $max_description_len = null, string  $summarizer_option = self::BASIC_SUMMARIZER) 

Set-ups the any indexing plugins associated with this page processor

Parameters

array $plugins

an array of indexing plugins which might do further processing on the data handles by this page processor

integer $max_description_len

maximal length of a page summary

string $summarizer_option

CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

process()

process(string  $page, string  $url) : array

Used to extract the title, description and links from a string consisting of ebook publication data.

Parameters

string $page

epub contents

string $url

the url where the page contents came from, used to canonicalize relative links

Returns

array —

a summary of the contents of the page

calculateLang()

calculateLang(string  $sample_text = null, string  $url = null) : string

Tries to determine the language of the document by looking at the $sample_text and $url provided the language

Parameters

string $sample_text

sample text to try guess the language from

string $url

url of web-page as a fallback look at the country to figure out language

Returns

string —

language tag for guessed language

getBetweenTags()

getBetweenTags(string  $string, integer  $cur_pos, string  $start_tag, string  $end_tag) : array

Gets the text between two tags in a document starting at the current position.

Parameters

string $string

document to extract text from

integer $cur_pos

current location to look if can extract text

string $start_tag

starting tag that we want to extract after

string $end_tag

ending tag that we want to extract until

Returns

array —

pair consisting of when in the document we are after the end tag, together with the data between the two tags

extractHttpHttpsUrls()

extractHttpHttpsUrls(string  $page) : array

Tries to extract http or https links from a string of text.

Does this by a very approximate regular expression.

Parameters

string $page

text string of a document

Returns

array —

a set of http or https links that were extracted from the document

closeDanglingTags()

closeDanglingTags(\seekquarry\yioop\library\processors\string&  $page) 

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.

Parameters

\seekquarry\yioop\library\processors\string& $page

a reference to an xml or html document

dom()

dom(string  $page) : object

Return a document object based on a string containing the contents of a web page

Parameters

string $page

a web page

Returns

object —

document object

handle()

handle(string  $page, string  $url) : array

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

Parameters

string $page

string of a web document

string $url

location the document came from

Returns

array —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

initializeIndexedFileTypes() 

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

xmlToObject()

xmlToObject(string  $xml) : array

Used to extract the DOM tree containing the information about the epub file such as title, author, language, unique identifier of the book from a string consisting of ebook publication content OPF file.

Parameters

string $xml

page contents

Returns

array —

an information about the contents of the page