Methods

isSchemeCrawlable()

isSchemeCrawlable(string  $url) : boolean

Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).

Parameters

string $url

the url to check

Returns

boolean —

returns true if it is either http,https, or gopher and false otherwise

simplifyUrl()

simplifyUrl(string  $url, integer  $max_len) : string

Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary

Parameters

string $url

the url to trim

integer $max_len

length to shorten url to, 0 = no shortening

Returns

string —

the trimmed url

hasHostUrl()

hasHostUrl(string  $url) : boolean

Checks if the url has a host part.

Parameters

string $url

the url to check

Returns

boolean —

true if it does; false otherwise

getPort()

getPort(string  $url) : integer

Get the port number of a url if present; if not return 80

Parameters

string $url

the url to extract port number from

Returns

integer —

a port number

getScheme()

getScheme(string  $url) : integer

Get the scheme of a url if present; if not return http

Parameters

string $url

the url to extract scheme from

Returns

integer —

a port number

getLang()

getLang(string  $url) : \seekquarry\yioop\library\the

Attempts to guess the language tag based on url

Parameters

string $url

the url to parse

Returns

\seekquarry\yioop\library\the —

top level domain if present; false otherwise

getHost()

getHost(string  $url, boolean  $with_login_and_port = true) : \seekquarry\yioop\library\the

Get the host name portion of a url if present; if not return false

Parameters

string $url

the url to parse

boolean $with_login_and_port

whether to include user,password,port if present

Returns

\seekquarry\yioop\library\the —

host portion of the url if present; false otherwise

getBaseDomain()

getBaseDomain(string  $url) : string

Gets the domain of a url less any leading www

Parameters

string $url

to get domain of

Returns

string —

the base domain as defined above

getPath()

getPath(string  $url, boolean  $with_query_string = false) : \seekquarry\yioop\library\the

Get the path portion of a url if present; if not return null

Parameters

string $url

the url to parse

boolean $with_query_string

(whether to also include the query string at the end of the path)

Returns

\seekquarry\yioop\library\the —

host portion of the url if present; null otherwise

getHostAndPath()

getHostAndPath(string  $url, boolean  $with_login_and_port = true, boolean  $with_query_string = false) : array

Returns as a two element array the host and path of a url

Parameters

string $url

initial url to get host and path of

boolean $with_login_and_port

controls whether the host should should contain login and port info

boolean $with_query_string

says whether the path should contain the query string as well

Returns

array —

host and the path as a pair

getHostSubdomains()

getHostSubdomains(string  $url) : array

Gets the subdomains of the host portion of a url. So

http://a.b.c/d/f/ will return a.b.c, .a.b.c, b.c, .b.c, c, .c

Parameters

string $url

the url to extract prefixes from

Returns

array —

the array of url prefixes

isPathMemberRegexPaths()

isPathMemberRegexPaths(string  $path, array  $robot_paths) : boolean

Checks if $path matches against any of the Robots.txt style regex paths in $paths

Parameters

string $path

a path component of a url

array $robot_paths

in format of robots.txt regex paths

Returns

boolean —

whether it is a member or not

getWordsInHostUrl()

getWordsInHostUrl(string  $url) : string

Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .

Ignores a leading www and also ignore tld.

For example, "http://www.yahoo.com/" returns " yahoo "

Parameters

string $url

a url to figure out the file type for

Returns

string —

space separated words extracted.

getWordsLastPathPartUrl()

getWordsLastPathPartUrl(string  $url) : string

Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "

Parameters

string $url

a url to figure out the file type for

Returns

string —

space separated words extracted.

getDocumentType()

getDocumentType(string  $url, string  $default = "html") : string

Given a url, makes a guess at the file type of the file it points to

Parameters

string $url

a url to figure out the file type for

string $default

default type to be returned in the case that document type cannot be determined from the url, defaults to html

Returns

string —

the guessed file type.

getDocumentFilename()

getDocumentFilename(string  $url) : string

Gets the filename portion of a url if present; otherwise returns "Some File"

Parameters

string $url

a url to parse

Returns

string —

the filename portion of this url

getQuery()

getQuery(string  $url) : string

Get the query string component of a url

Parameters

string $url

a url to get the query string out of

Returns

string —

the query string if present; null otherwise

getFragment()

getFragment(string  $url) : string

Get the url fragment string component of a url

Parameters

string $url

a url to get the url fragment string out of

Returns

string —

the url fragment string if present; null otherwise

canonicalLink()

canonicalLink(string  $link, string  $site, string  $no_fragment = true) : string

Given a $link that was obtained from a website $site, returns a complete URL for that link.

For example, the $link some_dir/test.html on the $site http://www.somewhere.com/bob would yield the complete url http://www.somewhere.com/bob/some_dir/test.html

Parameters

string $link

a relative or complete url

string $site

a base url

string $no_fragment

if false then if the url had a fragment (#link_within_page) then the fragement will be included

Returns

string —

a complete url based on these two pieces of information

checkRecursiveUrl()

checkRecursiveUrl(string  $url, integer  $repeat_threshold = 3) : boolean

Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times

A pattern like bob/.../bob counts as own reptition. bob/.../alice/.../bob/.../alice would count as two (... should be read as ellipsis, not a directory name).If the threshold is three and there are at least three repeated mathes this function return true; it returns false otherwise.

Parameters

string $url

the url to check

integer $repeat_threshold

the number of repeats of a subdir name to trigger a true response

Returns

boolean —

whether a repeated subdirectory name with more matches than the threshold was found

isLocalhostUrl()

isLocalhostUrl(string  $url) : boolean

Checks if a $url is on localhost

Parameters

string $url

the url to check

Returns

boolean —

whether or not it is on localhost

urlMemberSiteArray()

urlMemberSiteArray(string  $url, array  $site_array, string  $name, boolean  $return_rule = false) : mixed

Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.

Parameters

string $url

url to check

array $site_array

sites to check against

string $name

identifier to store $site_array with in this public function's cache

boolean $return_rule

whether when a match is found to return true or to return the matching site rule

Returns

mixed —

whether the url belongs to one of the sites

cleanRedundantLinks()

cleanRedundantLinks(array  $links, string  $parent_url) : array

Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)

Parameters

array $links

pairs of the form $link =>$link_info

string $parent_url

a site that the links were found on

Returns

array —

just those links which pass the relevancy test

pruneLinks()

pruneLinks(array  $links, integer  $max_links = \seekquarry\yioop\configs\MAX_LINKS_PER_PAGE) : array

Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.

To compute this, we count the number of terms by splitting on white space. We then multiply this by the ratio of the compressed length of the text divided by its uncompressed length.

Parameters

array $links

list of pairs $url=>$text

integer $max_links

maximum number of links from $links to return

Returns

array —

$out_links extracted from $links accodring to the description above.

countCompanyLevelDomainsInCommonDetectFarm()

countCompanyLevelDomainsInCommonDetectFarm(string  $url, array  $links, integer  $threshold = 200) : integer

Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.

If this is detected this method returns -1. This method also returns -1 if (2) there seem to be lots of links ($threshold) from the current domain to a single domain that shares the same company domain. This might indicate a domain md5_hash.foo.com with lots of links to a domain www.foo.com

Parameters

string $url

the url to compare against $links

array $links

an array of urls

integer $threshold

number above which if either situation (1) or (2) above happens then deem site spam

Returns

integer —

the number of times $url shares the cld with a link in $links. If thinks part of link farm returns -1

getCompanyLevelDomain()

getCompanyLevelDomain(string  $url) : string

Calculates the company level domain for the given url

For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations.

Parameters

string $url

url to determine cld for

Returns

string —

the cld of $url

guessMimeTypeFromFileName()

guessMimeTypeFromFileName(string  $file_name, string  $default = 'text/plain') : string

Guess mime type based on extension of the file

Parameters

string $file_name

name of the file

string $default

what mime type to return if mime type couldn't be determined

Returns

string —

$mime_type for the given file name

extractTextFromUrl()

extractTextFromUrl(string  $url) : string

Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents

Parameters

string $url

to find text that might say what link is about

Returns

string —

heuristically derived text.