\seekquarry\yioop\libraryScraperManager

Class used by html processors to detect if a page matches a particular signature such as that of a content management system, and also to provide scraping mechanisms for the content of such a page

Summary

Methods
Properties
Constants
getScraper()
applyScraperRules()
checkSignature()
getContentByXquery()
removeContentByXquery()
No public properties found
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Methods

getScraper()

getScraper(string  $page, array  $scrapers) : array

Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.

Parameters

string $page

the html page to check

array $scrapers

an array of scrapers to check against

Returns

array —

an associative array of scraper properties if a matching scraper signature found; otherwise, the empty array

applyScraperRules()

applyScraperRules(string  $page,   $scraper) : string

Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content

Parameters

string $page

the web page to operate on

$scraper

Returns

string —

the result of extracting first xpath content and deleting from it according to the remaining xpath rules

checkSignature()

checkSignature(string  $page, string  $signature) : boolean

If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found

Parameters

string $page

a web document to check

string $signature

an xpath to check against

Returns

boolean —

true if the given xpath return a non empty dom node list

getContentByXquery()

getContentByXquery(string  $page, string  $query) : \DOMDocument

Get the contents of a document via an xpath

Parameters

string $page

a document to apply the xpath query against

string $query

the xpath query to run

Returns

\DOMDocument —

dom of a simplified web page containing nodes matching xpath query within an html body tag.

removeContentByXquery()

removeContentByXquery(\DOMDocument  $dom, string  $query) 

Removes from the contents of a DOMDocument the results of an xpath query

Parameters

\DOMDocument $dom

a document to apply the xpath query against

string $query

the xpath query to run