$index_archive
$index_archive : object
The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
WordFilterPlugin is used to filter documents by terms during a crawl.
When this plugin is in use, each document summary that is generated by a TextProcessor or subclass during a crawl will be further processed by it pageSummaryProcessing method. First a set of applicable rules is computed base on the url of where the summary came from. (see documentation in factory example for more info on how the applicable rules are determined). Then as part of this processing the summary's title and description are sent to the method checkFilter. Here they are compared against the array of rules $this->filter_rules which consists of a list of rules each of which has a PRECONDITIONS and an ACTIONS field. Actions can either be directives that might appear within a ROBOTS meta tag of an HTML document: NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE or can be the word NOPROCESS, JUSTFOLLOW, NOTCONTAIN. The preconditions is checked in the function checkFilter. Details on what constitutes are legal precondition are described in the
$filter_rules : array
An array of rules. A rule is itself an array with two fields PRECONDITIONS and ACTIONS. ACTIONS is an array with elements from NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE, NOTCONTAIN, JUSTFOLLOW, and NOPROCESS which are to be followed if the PRECONDITIONS for the rule are met. PRECONDITIONS are an array of pairs term => frequency. term is a term to check in the document frequency indicates how often the term must appear for the condition to hold. An integer frequency value greater or equal to 1 is treated as raw count of occurrences that is required; a value between 0 and 1 is treated a fraction of the document that must be made up of occurrence of that term. The array in $this->filter rules is typically created by calling $this->parseRules() which converts the string in $this->rules_string into the format described above
pageProcessing(string $page, string $url) : array
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
string | $page | web-page contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION
pageSummaryProcessing(\seekquarry\yioop\library\indexing_plugins\array& $summary, string $url)
This method adds robots metas to or removes entirely a summary produced by a text page processor or its subsclasses depending on whether the summary title and description satisfy various rules in $this->filter_rules
\seekquarry\yioop\library\indexing_plugins\array& | $summary | the summary data produced by the relevant page processor's handle method; modified in-place. |
string | $url | the url where the summary contents came from |
postProcessing(string $index_name)
This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
string | $index_name | the name/timestamp of an IndexArchiveBundle to do post processing for |
getAdditionalMetaWords() : array
Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
meta words => description length pairs
checkFilter(string $preconditions, string $test_string) : boolean
Used to check if $precondition is met by a supplied string.
string | $preconditions | the terms and their frequencies to search for |
string | $test_string | string to check whether preconditions met |
whether the summary should be filtered or not
loadConfiguration() : array
Reads plugin configuration data from data/word_filter_plugin.txt on the name server into $this->rule_string. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)
configuration associative array
setConfiguration(array $configuration)
Takes a configuration array of rules and sets them as the rules for this instance of the plugin. Typically used on a queue_server or on a fetcher. It first sets the value of $this->filter_rules, then in case we later call saveConfiguration(), it also call serializeRules to store the serial format in $this->rules_string
array | $configuration |
configureHandler(\seekquarry\yioop\library\indexing_plugins\array& $data)
Behaves as a "controller" for the configuration page of the plugin.
It is called by the AdminController pageOptions activity method to let the plugin handle any configuration $_REQUEST data sent by this activity with regard to the plugin. This method sees if the $_REQUEST has word filter plugin configuration data, and if so cleans and saves it. It then modifies $data so that if the plugin's configuration view is drawn it makes use of the current plugin configuration info.
\seekquarry\yioop\library\indexing_plugins\array& | $data | info to be used by the admin view to draw itself. |