$index_archive
$index_archive : object
The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
Base indexing plugin Class. An indexing plugin allows a developer to do additional processing on web pages during a crawl, then after the web crawl is over do post processing on the additional data that was collected. For example, during a crawl one might by analysing web pages mark pages that have recipes on them with the meta word recipe:all, then after the crawl is over do post processing such as clustering the recipe's found and add additional meta words to retrieve recipe's by principle ingredient.
Yioop comes included with two example subclasses of IndexingPlugins to illustrate how to write plugins: recipe_plugin.php and word_filter.php.
Subclasses of IndexingPlugin typically override some of the following four methods:
static getProcessors() -- returns an array of strings of page processor names which a plugin should be used with. For example, a plugin might want to alter the summary whenever an HtmlProcessor is used on a page, so this array should contain HtmlProcessor, but on the other hand, the plugin might not need to alter anything when the JpgProcessor is in use, so the returned array shouldn't contain JpgProcessor
pageProcessing($page, $url) -- which is called by a page processor when a page is being processed. It returns additional subdoc page summary info which is then handed back to the fetcher (@see pageProcessing method below for more info.)
pageSummaryProcessing(&$summary) -- which is called by a page processor in a fetcher after the initial summary has been generated (by processor itself and all plugins which are associated with the processor). This method can be used to further modify the summary
getAdditionalMetaWords() -- which is called when meta words are extracted from a query at search time. This allows the plugin to specify its own meta words to be extracted from the query. @see getAdditionalMetaWords for more details on the return type of this method.
If you would like to write a plugin which can be configured on the Admin > Page Options page, then you need to write four other methods:
loadConfiguration() -- which can read plugin configuration data from persistent storage on the name server into an array or object when a crawl is started. This data is then automatically serialized and sent to queue servers as part of starting a crawl
setConfiguration() -- which takes a configuration array or object and uses it to initialize an instance of the plugin on a queue_server or on a fetcher.
configureHandler(&$data) -- which is called by the AdminController pageOptions activity method to let the plugin handle any configuration $_REQUEST data sent by this activity with regard to the plugin and to also let plugin modify the $data which might be sent to the plugin's view. This method would typically be called on the name server and so can be used to save (or to call a method which saves) any configuration data extracted from the request.
configureView(&$data) -- which is called to draw the HTML configure screen used by the plugin given the information in &$data. This might display a form a user would use to alter the behavior of the plugin
Subclasses of IndexingPlugin stored in WORK_DIRECTORY/app/lib/indexing_plugins will be detected by Yioop. So one can add code there to make it easier to upgrade Yioop. I.e., your site specific code can stay in the work directory and you merely need to replace the Yioop folder when upgrading.
pageProcessing(string $page, string $url) : array
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
string | $page | web-page contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION
pageSummaryProcessing(\seekquarry\yioop\library\indexing_plugins\array& $summary, string $url)
Optionally modifies the page summary array produced by the PageProcessor handle method in place. This hook provides a way to easily modify the title, description, and meta words of a page. Only the PAGE, CRAWL_DELAY, ROBOT_PATHS, ROBOT_METAS, AGENT_LIST, TITLE, DESCRIPTION, META_WORDS, LANG, LINKS, and THUMB fields of the summary will be respected. If you add custom meta words, then you must define them in the getAdditionalMetaWords function for this plugin, or they will not be recognized in queries.
\seekquarry\yioop\library\indexing_plugins\array& | $summary | the summary data produced by the relevant page processor's handle method; modified in-place. |
string | $url | the url where the summary contents came from |
postProcessing(string $index_name)
This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
string | $index_name | the name/timestamp of an IndexArchiveBundle to do post processing for |
getAdditionalMetaWords() : array
Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
meta words => description length pairs