\seekquarry\yioop\library\indexing_pluginsRecipePlugin

This class handles recipe processing.

It extracts ingredients from the recipe pages while crawling. It clusters the recipes using Kruskal's minimum spanning tree algorithm after crawl is stopped. This plugin was designed by looking at what was needed to screen scrape recipes from the following sites:

https://allrecipes.com/ http://www.geniuskitchen.com/ http://www.betterrecipes.com/ https://www.bettycrocker.com/

Summary

Methods
Properties
Constants
__construct()
pageProcessing()
pageSummaryProcessing()
postProcessing()
getProcessors()
getAdditionalMetaWords()
getIngredientName()
$index_archive
$db
$basic_ingredients
CLUSTER_RATIO
NUM_RECIPES_PER_SHARD
No protected methods found
No protected properties found
N/A
extractMinimalSpanningTreeEdges()
getClusters()
No private properties found
N/A

Constants

CLUSTER_RATIO

CLUSTER_RATIO

Ratio of clusters/total number of recipes seen

NUM_RECIPES_PER_SHARD

NUM_RECIPES_PER_SHARD

Number of recipes to put into a shard before switching shards while clustering

Properties

$index_archive

$index_archive : object

The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method

Type

object

$db

$db : object

Reference to a database object that might be used by models on this plugin

Type

object

$basic_ingredients

$basic_ingredients : array

Ingredients that are common to many recipes so unlikely to be the main ingredient for a recipe

Type

array

Methods

__construct()

__construct() 

Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object

pageProcessing()

pageProcessing(string  $page, string  $url) : array

This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher. For the recipe plugin a sub-document will be the title of the recipe. The description will consists of the ingredients of the recipe. Ingredients will be separated by ||

Parameters

string $page

web-page contents

string $url

the url where the page contents came from, used to canonicalize relative links

Returns

array —

consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION

pageSummaryProcessing()

pageSummaryProcessing(\seekquarry\yioop\library\indexing_plugins\array&  $summary, string  $url) 

Optionally modifies the page summary array produced by the PageProcessor handle method in place. This hook provides a way to easily modify the title, description, and meta words of a page. Only the PAGE, CRAWL_DELAY, ROBOT_PATHS, ROBOT_METAS, AGENT_LIST, TITLE, DESCRIPTION, META_WORDS, LANG, LINKS, and THUMB fields of the summary will be respected. If you add custom meta words, then you must define them in the getAdditionalMetaWords function for this plugin, or they will not be recognized in queries.

Parameters

\seekquarry\yioop\library\indexing_plugins\array& $summary

the summary data produced by the relevant page processor's handle method; modified in-place.

string $url

the url where the summary contents came from

postProcessing()

postProcessing(string  $index_name) 

Implements post processing of recipes. recipes are extracted ingredients are scrubbed and recipes are clustered. The clustered recipes are added back to the index.

Parameters

string $index_name

index name of the current crawl.

getProcessors()

getProcessors() : array

Which mime type page processors this plugin should do additional processing for

Returns

array —

an array of page processors

getAdditionalMetaWords()

getAdditionalMetaWords() : array

Returns an array of additional meta words which have been added by this plugin

Returns

array —

meta words and maximum description length of results allowed for that meta word (in this case 2000 as want to allow sufficient descriptions of whole recipes)

getIngredientName()

getIngredientName(string  $text) : string

Extracts the main ingredient from the ingredient.

Parameters

string $text

ingredient.

Returns

string —

$name main ingredient

extractMinimalSpanningTreeEdges()

extractMinimalSpanningTreeEdges(array  $edges) : array

Creates tree from the input and apply Kruskal's algorithm to find minimal.

spanning tree

Parameters

array $edges

elements of form (recipe_1_title, recipe_2_title, weight)

Returns

array —

$min_edges just those edges from the original edgest needed to make a minimal spanning

getClusters()

getClusters(array  $recipe_adjacency_weights, array  $distinct_ingredients) : array

Clusters the recipes from an array recipe adjacency weights.

Parameters

array $recipe_adjacency_weights

array of triples (recipe_1_title, recipe_2_title, weight)

array $distinct_ingredients

list of possible ingredients

Returns

array —

list of clusters of recipes. This array will have total_number_of_recipes * self::CLUSTER_RATIO many clusters. Each cluster will contain an ingredient field with the most common non basic ingredient found in the recipes of that cluster.