seek_quarry
[ class tree: seek_quarry ] [ index: seek_quarry ] [ all elements ]

Class: UrlParser

Source Location: /lib/url_parser.php

Class Overview


Library of functions used to manipulate and to extract components from urls


Author(s):

  • Chris Pollett

Methods



Class Details

[line 46]
Library of functions used to manipulate and to extract components from urls



Tags:

author:  Chris Pollett


[ Top ]


Class Methods


static method canonicalLink [line 591]

static string canonicalLink( string $link, string $site, [string $no_fragment = true])

Given a $link that was obtained from a website $site, returns a complete URL for that link.

For example, the $link some_dir/test.html on the $site http://www.somewhere.com/bob would yield the complete url http://www.somewhere.com/bob/some_dir/test.html




Tags:

return:  a complete url based on these two pieces of information


Parameters:

string   $link   a relative or complete url
string   $site   a base url
string   $no_fragment   if false then if the url had a fragment (#link_within_page) then the fragement will be included

[ Top ]

static method checkRecursiveUrl [line 699]

static bool checkRecursiveUrl( string $url, [int $repeat_threshold = 3])

Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times

A pattern like bob/.../bob counts as own reptition. bob/.../alice/.../bob/.../alice would count as two (... should be read as ellipsis, not a directory name).If the threshold is three and there are at least three repeated mathes this function return true; it returns false otherwise.




Tags:

return:  whether a repeated subdirectory name with more matches than the threshold was found


Parameters:

string   $url   the url to check
int   $repeat_threshold   the number of repeats of a subdir name to trigger a true response

[ Top ]

static method cleanRedundantLinks [line 825]

static array cleanRedundantLinks( array $links, string $parent_url)

Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)



Tags:

return:  just those links which pass the relevancy test


Parameters:

array   $links   pairs of the form $link =>$text
string   $parent_url   a site that the links were found on

[ Top ]

static method getDocumentFilename [line 518]

static string getDocumentFilename( string $url)

Gets the filename portion of a url if present; otherwise returns "Some File"



Tags:

return:  the filename portion of this url


Parameters:

string   $url   a url to parse

[ Top ]

static method getDocumentType [line 490]

static string getDocumentType( string $url)

Given a url, makes a guess at the file type of the file it points to



Tags:

return:  the guessed file type.


Parameters:

string   $url   a url to figure out the file type for

[ Top ]

static method getFragment [line 561]

static string getFragment( string $url)

Get the url fragment string component of a url



Tags:

return:  the url fragment string if present; NULL otherwise


Parameters:

string   $url   a url to get the url fragment string out of

[ Top ]

static method getHost [line 116]

static the getHost( string $url, [ $with_login_and_port = true], bool $with_login)

Get the host name portion of a url if present; if not return false



Tags:

return:  host portion of the url if present; false otherwise


Parameters:

string   $url   the url to parse
bool   $with_login   whether to include user,password,port if present
   $with_login_and_port  

[ Top ]

static method getHostPaths [line 299]

static array getHostPaths( string $url)

Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url

http://host.com/b/c/ would yield http://host.com/ , http://host.com/b, http://host.com/b/, http://host.com/b/c, http://host.com/b/c/




Tags:

return:  the array of url prefixes


Parameters:

string   $url   the url to extract prefixes from

[ Top ]

static method getHostSubdomains [line 336]

static array getHostSubdomains( string $url)

Gets the subdomains of the host portion of a url. So

http://a.b.c/d/f/ will return a.b.c, .a.b.c, b.c, .b.c, c, .c




Tags:

return:  the array of url prefixes


Parameters:

string   $url   the url to extract prefixes from

[ Top ]

static method getLang [line 158]

static the getLang( string $url)

Attempts to guess the language tag based on url



Tags:

return:  top level domain if present; false otherwise


Parameters:

string   $url   the url to parse

[ Top ]

static method getPath [line 267]

static the getPath( string $url, [bool $with_query_string = false])

Get the path portion of a url if present; if not return NULL



Tags:

return:  host portion of the url if present; NULL otherwise


Parameters:

string   $url   the url to parse
bool   $with_query_string   (whether to also include the query string at the end of the path)

[ Top ]

static method getQuery [line 543]

static string getQuery( string $url)

Get the query string component of a url



Tags:

return:  the query string if present; NULL otherwise


Parameters:

string   $url   a url to get the query string out of

[ Top ]

static method getWordsIfHostUrl [line 418]

static string getWordsIfHostUrl( string $url)

Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .

Ignores a leading www and also ignore tld.

For example, "http://www.yahoo.com/" returns " yahoo "




Tags:

return:  space separated words extracted.


Parameters:

string   $url   a url to figure out the file type for

[ Top ]

static method getWordsLastPathPartUrl [line 454]

static string getWordsLastPathPartUrl( string $url)

Given a url, extracts the words in the last path part of the url

For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "




Tags:

return:  space separated words extracted.


Parameters:

string   $url   a url to figure out the file type for

[ Top ]

static method hasHostUrl [line 102]

static bool hasHostUrl( string $url)

Checks if the url has a host part.



Tags:

return:  true if it does; false otherwise


Parameters:

string   $url   the url to check

[ Top ]

static method isLocalhostUrl [line 727]

static bool isLocalhostUrl( string $url)

Checks if a $url is on localhost



Tags:

return:  whether or not it is on localhost


Parameters:

string   $url   the url to check

[ Top ]

static method isPathMemberRegexPaths [line 365]

static bool isPathMemberRegexPaths( string $path, array $robot_paths)

Checks if $path matches against any of the Robots.txt style regex paths in $paths



Tags:

return:  whether it is a member or not


Parameters:

string   $path   a path component of a url
array   $robot_paths   in format of robots.txt regex paths

[ Top ]

static method isSchemeHttpOrHttps [line 56]

static bool isSchemeHttpOrHttps( string $url)

Checks if the url scheme is either http or https.



Tags:

return:  returns true if it is either http or https and false otherwise


Parameters:

string   $url   the url to check

[ Top ]

static method isVideoUrl [line 798]

static bool isVideoUrl( &$url, array $video_prefixes, string $url)

Checks if a URL corresponds to a known playback page of a video sharing site



Tags:

return:  whether or not corresponds to video playback page of a known video site


Parameters:

string   $url   the url to check
array   $video_prefixes   an array of prefixes of video sites
   &$url  

[ Top ]

static method simplifyUrl [line 76]

static string simplifyUrl( string $url, [int $max_len = 0])

Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary



Tags:

return:  the trimmed url


Parameters:

string   $url   the url to trim
int   $max_len   length to shorten url to, 0 = no shortening

[ Top ]

static method urlMemberSiteArray [line 759]

static mixed urlMemberSiteArray( string $url, array $site_array, [bool $return_rule = false])

Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.



Tags:

return:  whether the url belongs to one of the sites


Parameters:

string   $url   url to check
array   $site_array   sites to check against
bool   $return_rule   whether when a match is found to return true or to return the matching site rule

[ Top ]


Documentation generated by phpDocumentor 1.4.3