\seekquarry\yioop\library\processorsDocProcessor

Used to create crawl summary information for binary DOC files

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Summary

Methods

Properties

Constants

__construct()
process()
calculateLang()
getBetweenTags()
extractHttpHttpsUrls()
closeDanglingTags()
dom()
handle()
initializeIndexedFileTypes()
extractASCIIText()
checkPageForText()
checkAllZeros()
cleanTextBlock()

$plugin_instances
$summarizer
$summarizer_option
$max_description_len
$mime_processor
$image_types
$indexed_file_types

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/processors/DocProcessor.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\processors\PageProcessor

\seekquarry\yioop\library\processors\TextProcessor

\seekquarry\yioop\library\processors\DocProcessor

Properties

$plugin_instances

$plugin_instances : array

indexing_plugins which might be used with the current processor

Type

array

$summarizer

$summarizer : object

Stores the summarizer object used by this instance of page processor to be used in generating a summary

Type

object

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$mime_processor

$mime_processor : array

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

Type

array

$image_types

$image_types : array

Array filetypes which should be considered images.

Sub-classes add to this array with the types they handle

Type

array

$indexed_file_types

$indexed_file_types : array

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

Sub-classes add to this array with the types they handle

Type

array

Methods

__construct()

__construct(array  $plugins = array(), integer  $max_description_len = null, string  $summarizer_option = self::BASIC_SUMMARIZER)

Set-ups the any indexing plugins associated with this page processor

Parameters

array	$plugins	an array of indexing plugins which might do further processing on the data handles by this page processor
integer	$max_description_len	maximal length of a page summary
string	$summarizer_option	CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

process()

process(string  $page, string  $url) : array

Used to extract the title, description and links from a string consisting of Word Doc data (2004 or earlier).

Parameters

string	$page	the web-page contents
string	$url	the url where the page contents came from, used to canonicalize relative links

Returns

array —

a summary of the contents of the page

calculateLang()

calculateLang(string  $sample_text = null, string  $url = null) : string

Tries to determine the language of the document by looking at the $sample_text and $url provided the language

Parameters

string	$sample_text	sample text to try guess the language from
string	$url	url of web-page as a fallback look at the country to figure out language

Returns

string —

language tag for guessed language

getBetweenTags()

getBetweenTags(string  $string, integer  $cur_pos, string  $start_tag, string  $end_tag) : array

Gets the text between two tags in a document starting at the current position.

Parameters

string	$string	document to extract text from
integer	$cur_pos	current location to look if can extract text
string	$start_tag	starting tag that we want to extract after
string	$end_tag	ending tag that we want to extract until

Returns

array —

pair consisting of when in the document we are after the end tag, together with the data between the two tags

extractHttpHttpsUrls()

extractHttpHttpsUrls(string  $page) : array

Tries to extract http or https links from a string of text.

Does this by a very approximate regular expression.

Parameters

string

$page

text string of a document

Returns

array —

a set of http or https links that were extracted from the document

closeDanglingTags()

closeDanglingTags(\seekquarry\yioop\library\processors\string&  $page)

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.

Parameters

\seekquarry\yioop\library\processors\string&

$page

a reference to an xml or html document

dom()

dom(string  $page) : object

Return a document object based on a string containing the contents of a web page

Parameters

string

$page

a web page

Returns

object —

document object

handle()

handle(string  $page, string  $url) : array

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

Parameters

string	$page	string of a web document
string	$url	location the document came from

Returns

array —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

extractASCIIText()

extractASCIIText(string  $doc)

This is the main text from Word doc extractor A Word Doc consists of a FIB, Piece Table, and DocumentStream. The last contains the text.

The piece table is supposed to be used to reconstruct the order of the text from the DocumentStream and the FIB, file information block,is supposed to tell us where the piece table is. I am not using any of this for now. I am just brute force looking for the text which I know has to be at a page (256 byte) boundary. I then go until I no longer see ASCII. So the order of text extracted might be screwed up right now.

Parameters

string

$doc

string data of a 2004 or earlier Word doc

checkPageForText()

checkPageForText(string  $doc, integer  $pos) : \seekquarry\yioop\library\processors\whether

Scans document starting at given position and looking forward eight character to see if these are ASCII printable or not.

Parameters

string	$doc	document to scan
integer	$pos	position to start scanning

Returns

\seekquarry\yioop\library\processors\whether —

the eight next characters were ASCII printable

checkAllZeros()

checkAllZeros(string  $doc, integer  $pos) : \seekquarry\yioop\library\processors\whether

Scans document starting at given position and looking forward eight character to see if these are all \0 or not.

Parameters

string	$doc	document to scan
integer	$pos	position to start scanning

Returns

\seekquarry\yioop\library\processors\whether —

the eight next characters were \0

cleanTextBlock()

cleanTextBlock(string  $doc, integer  $pos) : \seekquarry\yioop\library\processors\substring

Scans document starting at given position forward eight character returning those characters which are ASCII printable

Parameters

string	$doc	document to scan
integer	$pos	position to start scanning

Returns

\seekquarry\yioop\library\processors\substring —

of ASCII printable characters