\seekquarry\yioop\library\processorsRobotProcessor

Processor class used to extract information from robots.txt files

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in WORK_DIRECTORY/app/lib/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Summary

Methods

Properties

Constants

__construct()
handle()
process()
initializeIndexedFileTypes()
makeCanonicalRobotPath()

$plugin_instances
$summarizer
$summarizer_option
$max_description_len
$mime_processor
$image_types
$indexed_file_types

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/processors/RobotProcessor.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\processors\PageProcessor

\seekquarry\yioop\library\processors\RobotProcessor

Properties

$plugin_instances

$plugin_instances : array

indexing_plugins which might be used with the current processor

Type

array

$summarizer

$summarizer : object

Stores the summarizer object used by this instance of page processor to be used in generating a summary

Type

object

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$mime_processor

$mime_processor : array

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

Type

array

$image_types

$image_types : array

Array filetypes which should be considered images.

Sub-classes add to this array with the types they handle

Type

array

$indexed_file_types

$indexed_file_types : array

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

Sub-classes add to this array with the types they handle

Type

array

Methods

__construct()

__construct(array  $plugins = array(), integer  $max_description_len = null, string  $summarizer_option = self::BASIC_SUMMARIZER)

Set-ups the any indexing plugins associated with this page processor

Parameters

array	$plugins	an array of indexing plugins which might do further processing on the data handles by this page processor
integer	$max_description_len	maximal length of a page summary
string	$summarizer_option	CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

handle()

handle(string  $page, string  $url) : array

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

Parameters

string	$page	string of a web document
string	$url	location the document came from

Returns

array —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

process()

process(string  $page, string  $url) : array

Parses the contents of a robots.txt page extracting allowed, disallowed paths, crawl-delay, and sitemaps. We also extract a list of all user agent strings seen.

Parameters

string	$page	text string of a document
string	$url	location the document came from, not used by TextProcessor at this point. Some of its subclasses override this method and use url to produce complete links for relative links within a document

Returns

array —

a summary of (title, description, links, and content) of the information in $page

initializeIndexedFileTypes()

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

makeCanonicalRobotPath()

makeCanonicalRobotPath(string  $path) : string

Converts a path in a robots.txt file into a standard form usable by Yioop For robot paths foo is treated the same as /foo Path might contain urlencoded characters. These are all decoded except for %2F which corresponds to a / (this is as per http://www.robotstxt.org/norobots-rfc.txt)

Parameters

string

$path

to convert

Returns

string —

Yioop canonical path