\seekquarry\yioop\library\indexing_pluginsWordfilterPlugin

WordFilterPlugin is used to filter documents by terms during a crawl.

When this plugin is in use, each document summary that is generated by a TextProcessor or subclass during a crawl will be further processed by it pageSummaryProcessing method. First a set of applicable rules is computed base on the url of where the summary came from. (see documentation in factory example for more info on how the applicable rules are determined). Then as part of this processing the summary's title and description are sent to the method checkFilter. Here they are compared against the array of rules $this->filter_rules which consists of a list of rules each of which has a PRECONDITIONS and an ACTIONS field. Actions can either be directives that might appear within a ROBOTS meta tag of an HTML document: NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE or can be the word NOPROCESS, JUSTFOLLOW, NOTCONTAIN. The preconditions is checked in the function checkFilter. Details on what constitutes are legal precondition are described in the

Summary

Methods
Properties
Constants
__construct()
pageProcessing()
pageSummaryProcessing()
postProcessing()
getProcessors()
getAdditionalMetaWords()
checkFilter()
saveConfiguration()
loadConfiguration()
setConfiguration()
configureHandler()
loadDefaultConfiguration()
parseRules()
serializeRules()
configureView()
$index_archive
$db
$filter_rules
$default_rules_string
$rules_string
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$index_archive

$index_archive : object

The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method

Type

object

$db

$db : object

Reference to a database object that might be used by models on this plugin

Type

object

$filter_rules

$filter_rules : array

An array of rules. A rule is itself an array with two fields PRECONDITIONS and ACTIONS. ACTIONS is an array with elements from NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE, NOTCONTAIN, JUSTFOLLOW, and NOPROCESS which are to be followed if the PRECONDITIONS for the rule are met. PRECONDITIONS are an array of pairs term => frequency. term is a term to check in the document frequency indicates how often the term must appear for the condition to hold. An integer frequency value greater or equal to 1 is treated as raw count of occurrences that is required; a value between 0 and 1 is treated a fraction of the document that must be made up of occurrence of that term. The array in $this->filter rules is typically created by calling $this->parseRules() which converts the string in $this->rules_string into the format described above

Type

array

$default_rules_string

$default_rules_string : string

Default rule string to be used if no other rules string is present

Type

string

$rules_string

$rules_string : string

A string containing a parsable set of filter_rules to be used by the WordFilterPlugin. The format of these rules is described in the default value of this rule string below.

Type

string

Methods

__construct()

__construct() 

Sets up the default word string for the word plugin

pageProcessing()

pageProcessing(string  $page, string  $url) : array

This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.

Parameters

string $page

web-page contents

string $url

the url where the page contents came from, used to canonicalize relative links

Returns

array —

consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION

pageSummaryProcessing()

pageSummaryProcessing(\seekquarry\yioop\library\indexing_plugins\array&  $summary, string  $url) 

This method adds robots metas to or removes entirely a summary produced by a text page processor or its subsclasses depending on whether the summary title and description satisfy various rules in $this->filter_rules

Parameters

\seekquarry\yioop\library\indexing_plugins\array& $summary

the summary data produced by the relevant page processor's handle method; modified in-place.

string $url

the url where the summary contents came from

postProcessing()

postProcessing(string  $index_name) 

This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.

Parameters

string $index_name

the name/timestamp of an IndexArchiveBundle to do post processing for

getProcessors()

getProcessors() : array

Which mime type page processors this plugin should do additional processing for

Returns

array —

an array of page processors

getAdditionalMetaWords()

getAdditionalMetaWords() : array

Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be

Returns

array —

meta words => description length pairs

checkFilter()

checkFilter(string  $preconditions, string  $test_string) : boolean

Used to check if $precondition is met by a supplied string.

Parameters

string $preconditions

the terms and their frequencies to search for

string $test_string

string to check whether preconditions met

Returns

boolean —

whether the summary should be filtered or not

saveConfiguration()

saveConfiguration() 

Saves to a file $this->rules_string, a field which contains the string rules that are being used with this plugin

loadConfiguration()

loadConfiguration() : array

Reads plugin configuration data from data/word_filter_plugin.txt on the name server into $this->rule_string. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)

Returns

array —

configuration associative array

setConfiguration()

setConfiguration(array  $configuration) 

Takes a configuration array of rules and sets them as the rules for this instance of the plugin. Typically used on a queue_server or on a fetcher. It first sets the value of $this->filter_rules, then in case we later call saveConfiguration(), it also call serializeRules to store the serial format in $this->rules_string

Parameters

array $configuration

configureHandler()

configureHandler(\seekquarry\yioop\library\indexing_plugins\array&  $data) 

Behaves as a "controller" for the configuration page of the plugin.

It is called by the AdminController pageOptions activity method to let the plugin handle any configuration $_REQUEST data sent by this activity with regard to the plugin. This method sees if the $_REQUEST has word filter plugin configuration data, and if so cleans and saves it. It then modifies $data so that if the plugin's configuration view is drawn it makes use of the current plugin configuration info.

Parameters

\seekquarry\yioop\library\indexing_plugins\array& $data

info to be used by the admin view to draw itself.

loadDefaultConfiguration()

loadDefaultConfiguration() : array

Reads plugin configuration data from the default setting of this plugin. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)

Returns

array —

configuration associative array

parseRules()

parseRules() 

Parse rules into array format from the string $this->rules_string into the array $this->filter_rules. $this->filter_rules is used when $this->pageSummaryProcessing(&$summary) is called.

serializeRules()

serializeRules() 

This is used to convert the array in $this->filter_rules into a string format in $this->rules_string which would be suitable for saving to disk or displaying on the configuration page.

configureView()

configureView(\seekquarry\yioop\library\indexing_plugins\array&  $data) 

Used to draw the HTML configure screen for the word filter plugin.

Parameters

\seekquarry\yioop\library\indexing_plugins\array& $data

contains configuration data to be used in drawing the view