\seekquarry\yioop\library\processorsPdfProcessor

Used to create crawl summary information for PDF files

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Summary

Methods

Properties

Constants

__construct()
process()
calculateLang()
getBetweenTags()
extractHttpHttpsUrls()
closeDanglingTags()
dom()
handle()
initializeIndexedFileTypes()
getEncodingTitle()
getText()
getNextObject()
objectDictionaryHas()
getObjectDictionary()
getObjectStream()
parseText()
parseBrackets()
parseParentheses()
convertChar()

$plugin_instances
$summarizer
$summarizer_option
$max_description_len
$mime_processor
$image_types
$indexed_file_types

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/processors/PdfProcessor.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\processors\PageProcessor

\seekquarry\yioop\library\processors\TextProcessor

\seekquarry\yioop\library\processors\PdfProcessor

Properties

$plugin_instances

$plugin_instances :array

indexing_plugins which might be used with the current processor

Type

array

$summarizer

$summarizer :object

Stores the summarizer object used by this instance of page processor to be used in generating a summary

Type

object

$summarizer_option

$summarizer_option :string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$max_description_len

$max_description_len :integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$mime_processor

$mime_processor :array

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

Type

array

$image_types

$image_types :array

Array filetypes which should be considered images.

Sub-classes add to this array with the types they handle

Type

array

$indexed_file_types

$indexed_file_types :array

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

Sub-classes add to this array with the types they handle

Type

array

Methods

__construct()

__construct(array  $plugins = array(),integer  $max_description_len = null,string  $summarizer_option = self::BASIC_SUMMARIZER)

Set-ups the any indexing plugins associated with this page processor

Parameters

array	$plugins	an array of indexing plugins which might do further processing on the data handles by this page processor
integer	$max_description_len	maximal length of a page summary
string	$summarizer_option	CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

process()

process(string  $page,string  $url): \seekquarry\yioop\library\processors\a

Used to extract the title, description and links from a string consisting of PDF data.

Parameters

string	$page	a string consisting of web-page contents
string	$url	the url where the page contents came from, used to canonicalize relative links

Returns

\seekquarry\yioop\library\processors\a —

summary of the contents of the page

calculateLang()

calculateLang(string  $sample_text = null,string  $url = null): string

Tries to determine the language of the document by looking at the $sample_text and $url provided the language

Parameters

string	$sample_text	sample text to try guess the language from
string	$url	url of web-page as a fallback look at the country to figure out language

Returns

string —

language tag for guessed language

getBetweenTags()

getBetweenTags(string  $string,integer  $cur_pos,string  $start_tag,string  $end_tag): array

Gets the text between two tags in a document starting at the current position.

Parameters

string	$string	document to extract text from
integer	$cur_pos	current location to look if can extract text
string	$start_tag	starting tag that we want to extract after
string	$end_tag	ending tag that we want to extract until

Returns

array —

pair consisting of when in the document we are after the end tag, together with the data between the two tags

extractHttpHttpsUrls()

extractHttpHttpsUrls(string  $page): array

Tries to extract http or https links from a string of text.

Does this by a very approximate regular expression.

Parameters

string

$page

text string of a document

Returns

array —

a set of http or https links that were extracted from the document

closeDanglingTags()

closeDanglingTags(\seekquarry\yioop\library\processors\string&  $page)

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.

Parameters

\seekquarry\yioop\library\processors\string&

$page

a reference to an xml or html document

dom()

dom(string  $page): object

Return a document object based on a string containing the contents of a web page

Parameters

string

$page

a web page

Returns

object —

document object

handle()

handle(string  $page,string  $url): array

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

Parameters

string	$page	string of a web document
string	$url	location the document came from

Returns

array —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

getEncodingTitle()

getEncodingTitle(string  $pdf_string): array

Returns the first encoding format information found in the PDF document

Parameters

string

$pdf_string

a string representing the PDF document

Returns

array —

[encoding, title] which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc as well as a title for the document if found

getText()

getText(string  $pdf_string,  $url,string  $encoding = ""): string

Gets the text out of a PDF document

Parameters

string	$pdf_string	a string representing the PDF document
	$url	the url where the page contents came from, used to canonicalize relative links
string	$encoding	which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Returns

string —

text extracted from the document

getNextObject()

getNextObject(string  $pdf_string,integer  $cur_pos): string

Gets between an obj and endobj tag at the current position in a PDF document

Parameters

string	$pdf_string	astring of a PDF document
integer	$cur_pos	a integer postion in that string

Returns

string —

the contents of the PDF object located at $cur_pos

objectDictionaryHas()

objectDictionaryHas(string  $object_dictionary,array  $type_array): \seekquarry\yioop\library\processors\whether

Checks if the PDF object's object dictionary is in a list of types

Parameters

string	$object_dictionary	the object dictionary to check
array	$type_array	the list of types to check against

Returns

\seekquarry\yioop\library\processors\whether —

it is in or not

getObjectDictionary()

getObjectDictionary(string  $object_string): string

Gets the object dictionary portion of the current PDF object

Parameters

string

$object_string

represents the contents of a PDF object

Returns

string —

the object dictionary for the object

getObjectStream()

getObjectStream(string  $object_string): string

Gets the object stream portion of the current PDF object

Parameters

string

$object_string

represents the contents of a PDF object

Returns

string —

the object stream for the object

parseText()

parseText(string  $data,string  $encoding = ""): string

Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.

Parameters

string	$data	source to extract character data from
string	$encoding	which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Returns

string —

extracted text

parseBrackets()

parseBrackets(string  $data,integer  $cur_pos,string  $encoding = ""): array

Extracts text till the next close brackets

Parameters

string	$data	source to extract character data from
integer	$cur_pos	position to start in $data
string	$encoding	which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Returns

array —

pair consisting of the final position in $data as well as extracted text

parseParentheses()

parseParentheses(string  $data,integer  $cur_pos,string  $encoding): array

Extracts ASCII text till the next close parenthesis

Parameters

string	$data	source to extract character data from
integer	$cur_pos	position to start in $data
string	$encoding	which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Returns

array —

pair consisting of the final position in $data as well as extracted text

convertChar()

convertChar(\seekquarry\yioop\library\processors\char  $cur_char,string  $encoding): string

Used to convert characters from one of the built in PDF encodings to UTF-8

Parameters

\seekquarry\yioop\library\processors\char	$cur_char	character to conver
string	$encoding	which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Returns

string —

resultign converted string for character