\seekquarry\yioop\library\indexing_pluginsAddressesPlugin

Used to extract emails, phone numbers, and addresses from a web page.

These are extracted into the EMAILS, PHONE_NUMBERS, and ADDRESSES fields of the page's summary.

Summary

Methods
Properties
Constants
__construct()
pageProcessing()
pageSummaryProcessing()
postProcessing()
getProcessors()
getAdditionalMetaWords()
parseSubdoc()
checkCandidate()
checkRegion()
checkPhoneOrEmail()
parseEmails()
parsePhones()
checkCountry()
checkZipPostalCodeWords()
checkStreet()
$index_archive
$db
$countries
$regions
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$index_archive

$index_archive : object

The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method

Type

object

$db

$db : object

Reference to a database object that might be used by models on this plugin

Type

object

$countries

$countries : array

Associative array of world countries and country code. Some entries are duplicated into country's local script

Type

array

$regions

$regions : array

List of common regions, abbreviations, and local spellings of regions of the US, Canada, Australia, UK, as well as major cities elsewhere

Type

array

Methods

__construct()

__construct() 

Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object

pageProcessing()

pageProcessing(string  $page, string  $url) : array

This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.

Parameters

string $page

web-page contents

string $url

the url where the page contents came from, used to canonicalize relative links

Returns

array —

consisting of a sequence of subdoc arrays found on the given page.

pageSummaryProcessing()

pageSummaryProcessing(array  $summary, string  $url) 

Adjusts the document summary of a page after the page processor's process method has been called so that the subdoc's fields associated with the addresses plugin get copied as fields of the whole page summary. Then it deletes the subdoc fields.

Parameters

array $summary

of current document. It will be adjusted by the code below

string $url

the url where the summary contents came from

postProcessing()

postProcessing(string  $index_name) 

This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.

Parameters

string $index_name

the name/timestamp of an IndexArchiveBundle to do post processing for

getProcessors()

getProcessors() : array

Which mime type page processors this plugin should do additional processing for

Returns

array —

an array of page processors

getAdditionalMetaWords()

getAdditionalMetaWords() : array

Returns an array of additional meta words which have been added by this plugin

Returns

array —

meta words and maximum description length of results allowed for that meta word

parseSubdoc()

parseSubdoc(string  $text) : array

Parses EMAILS, PHONE_NUMBERS and ADDRESSES from $text and returns an array with these three fields containing sub-arrays of the given items

Parameters

string $text

to use for extraction

Returns

array —

with found emails, phone numbers, and addresses

checkCandidate()

checkCandidate(array  $pre_address) : mixed

Checks if the passed sequence of lines has enough features of a postal address to call it an address. If so, return the address as a single string

Parameters

array $pre_address

an array of potential address lines

Returns

mixed —

false if not address, the lines imploded together using space if an address

checkRegion()

checkRegion(string  $line) : boolean

Used to check if a line countains a word associated with a province, state or major city.

Parameters

string $line

from address to check

Returns

boolean —

whether it contains acountry term

checkPhoneOrEmail()

checkPhoneOrEmail(string  $line) : boolean

Used to check if a line countains either an email address or a phone number

Parameters

string $line

from address to check

Returns

boolean —

whether it contains acountry term

parseEmails()

parseEmails(string  $line) : string

Extracts substrings from the provided $line that are in the format of an email address. Returns first email from line

Parameters

string $line

string to extract email from

Returns

string —

first email found on line

parsePhones()

parsePhones(string  $line) : array

Checks for a phone number related keyword in the line and if found extracts digits which are presumed to be a phone number

Parameters

string $line

to check for phone numbers

Returns

array —

all phone numbers detected by this method from the $line

checkCountry()

checkCountry(string  $line) : boolean

Used to check if a line contains a word associated with a World country or country code.

Parameters

string $line

from address to check

Returns

boolean —

whether it contains a country term

checkZipPostalCodeWords()

checkZipPostalCodeWords(string  $line) : boolean

Used to check if a line contains a word associated with a ZIP or Postal code

Parameters

string $line

from address to check

Returns

boolean —

whether it contains such a code

checkStreet()

checkStreet(string  $line) : boolean

Used to check if a given line in an address candidate has features associated with being a street address.

Parameters

string $line

address line to check

Returns

boolean —

whether or not it contains a word identified with being a street address such as WAY, AVENUE, STREET, etc.