\seekquarry\yioop\library\processorsImageProcessor

Base abstract class common to all processors used to create crawl summary information from images

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Summary

Methods
Properties
Constants
__construct()
handle()
process()
initializeIndexedFileTypes()
saveTempFile()
addWidthHeightSummary()
getXmpData()
createThumb()
$plugin_instances
$summarizer
$summarizer_option
$max_description_len
$mime_processor
$image_types
$indexed_file_types
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$plugin_instances

$plugin_instances : array

indexing_plugins which might be used with the current processor

Type

array

$summarizer

$summarizer : object

Stores the summarizer object used by this instance of page processor to be used in generating a summary

Type

object

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$mime_processor

$mime_processor : array

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

Type

array

$image_types

$image_types : array

Array filetypes which should be considered images.

Sub-classes add to this array with the types they handle

Type

array

$indexed_file_types

$indexed_file_types : array

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

Sub-classes add to this array with the types they handle

Type

array

Methods

__construct()

__construct(array  $plugins = array(), integer  $max_description_len = null, integer  $summarizer_option = self::BASIC_SUMMARIZER) 

Set-ups the any indexing plugins associated with this page processor

Parameters

array $plugins

an array of indexing plugins which might do further processing on the data handles by this page processor

integer $max_description_len

maximal length of a page summary

integer $summarizer_option

CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

handle()

handle(string  $page, string  $url) : array

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

Parameters

string $page

string of a web document

string $url

location the document came from

Returns

array —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

process()

process(string  $page, string  $url) : array

Extract summary data from the image provided in $page together the url in $url where it was downloaded from

ImageProcessor class defers a proper implementation of this method to subclasses

Parameters

string $page

the image represented as a character string

string $url

the url where the image was downloaded from

Returns

array —

summary information including a thumbnail and a description (where the description is just the url)

initializeIndexedFileTypes()

initializeIndexedFileTypes() 

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

saveTempFile()

saveTempFile(string  $page, string  $url, string  $file_extension) 

Used to save a temporary file with the data downloaded for a url while carrying out image processing

Parameters

string $page

contains data about an image that one needs to save

string $url

where $page data came from

string $file_extension

to be associated wit the $page data

addWidthHeightSummary()

addWidthHeightSummary(  $summary, string  $image_string) : array

Given an $image_string determines if possible its width and height then assigns the values into the CrawlConstants:WIDTH, CrawlConstants:HEIGHT fields of $summary

Parameters

$summary
string $image_string

the image represented as a character string

Returns

array —

summary information including a thumbnail and a description (where the description is just the url)

getXmpData()

getXmpData(string  $image_string) : array

Given an image try to extract and XMP info from it.

Parameters

string $image_string

the image represented as a character string

Returns

array —

XMP data converted from XML format to an array-like format

createThumb()

createThumb(object  $image) 

Used to create a thumbnail from an image object

Parameters

object $image

image object with image