CLUSTER_RATIO
CLUSTER_RATIO
Ratio of clusters/total number of recipes seen
This class handles recipe processing.
It extracts ingredients from the recipe pages while crawling. It clusters the recipes using Kruskal's minimum spanning tree algorithm after crawl is stopped. This plugin was designed by looking at what was needed to screen scrape recipes from the following sites:
https://allrecipes.com/ http://www.geniuskitchen.com/ http://www.betterrecipes.com/ https://www.bettycrocker.com/
pageProcessing(string $page, string $url) : array
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher. For the recipe plugin a sub-document will be the title of the recipe. The description will consists of the ingredients of the recipe. Ingredients will be separated by ||
string | $page | web-page contents |
string | $url | the url where the page contents came from, used to canonicalize relative links |
consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION
pageSummaryProcessing(\seekquarry\yioop\library\indexing_plugins\array& $summary, string $url)
Optionally modifies the page summary array produced by the PageProcessor handle method in place. This hook provides a way to easily modify the title, description, and meta words of a page. Only the PAGE, CRAWL_DELAY, ROBOT_PATHS, ROBOT_METAS, AGENT_LIST, TITLE, DESCRIPTION, META_WORDS, LANG, LINKS, and THUMB fields of the summary will be respected. If you add custom meta words, then you must define them in the getAdditionalMetaWords function for this plugin, or they will not be recognized in queries.
\seekquarry\yioop\library\indexing_plugins\array& | $summary | the summary data produced by the relevant page processor's handle method; modified in-place. |
string | $url | the url where the summary contents came from |
getAdditionalMetaWords() : array
Returns an array of additional meta words which have been added by this plugin
meta words and maximum description length of results allowed for that meta word (in this case 2000 as want to allow sufficient descriptions of whole recipes)
extractMinimalSpanningTreeEdges(array $edges) : array
Creates tree from the input and apply Kruskal's algorithm to find minimal.
spanning tree
array | $edges | elements of form (recipe_1_title, recipe_2_title, weight) |
$min_edges just those edges from the original edgest needed to make a minimal spanning
getClusters(array $recipe_adjacency_weights, array $distinct_ingredients) : array
Clusters the recipes from an array recipe adjacency weights.
array | $recipe_adjacency_weights | array of triples (recipe_1_title, recipe_2_title, weight) |
array | $distinct_ingredients | list of possible ingredients |
list of clusters of recipes. This array will have total_number_of_recipes * self::CLUSTER_RATIO many clusters. Each cluster will contain an ingredient field with the most common non basic ingredient found in the recipes of that cluster.