SeekQuarry/Yioop --
Open Source Pure PHP Search Engine, Crawler, and Indexer
Copyright (C) 2009 - 2020 Chris Pollett chris@pollett.org
LICENSE:
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see https://www.gnu.org/licenses/.
END LICENSE
A library of string, error reporting, log, hash, time, and conversion
functions
Replaces a pcre pattern with a replacement in $subject starting from
some offset.
Parameters
string
$pattern
a Perl compatible regular expression
string
$replacement
what to replace the pattern with
string
$subject
to search for pattern in
integer
$offset
character offset into $subject to begin searching from
Returns
string
—
result of the replacements
parse_ini_with_fallback()
parse_ini_with_fallback(string $file) : array
Yioop replacement for parse_ini_file($name, true) in case
parse_ini_file is on the disable_functions list. Name has underscores
to match original function. This function
checks if parse_ini_file is disabled on not. If not, it just
calls parse_ini_file; otherwise, it simulates it enough so
that configure.ini files used for string translations can be read.
Parameters
string
$file
filename of ini data to parse into an array
Returns
array
—
data parse from file
getIniAssignMatch()
getIniAssignMatch(string $matches) : mixed
Auxiliary function called from parse_ini_with_fallback to extract from
the $matches array produced by the former function's preg_match
what kind of assignment occurred in the ini file being parsed.
Parameters
string
$matches
produced by a preg_match in
parse_ini_with_fallback
Given a packed integer string, uses the top three bytes to calculate
a doc_index of a document in the shard, and uses the low order byte
to computer a number of occurrences of a word in that document.
Parameters
string
$posting
a string containing
a doc index position list pair coded encoded using modified9
\seekquarry\yioop\library\int&
$offset
a offset into the string where the modified9 posting
is encoded
boolean
$dedelta
if true then assumes the list is a sequence of
differences (a delta list) and undoes the difference to get
the original sequence
Returns
array
—
consisting of integer doc_index and a subarray consisting
of integer positions of word in doc.
the corresponding list of differences of adjacent
integers
deDeltaList()
deDeltaList(array $delta_list) : array
Given an array of differences of integers reconstructs the
original list. This computes the inverse of the deltaList function
Parameters
array
$delta_list
a list of nonegative integers
Returns
array
—
a nondecreasing list of integers
encodeModified9()
encodeModified9(array $list) : string
Encodes a sequence of integers x, such that 1 <= x <= 2<<28-1
as a string. NOTICE x>=1.
The encoded string is a sequence of 4 byte words (packed int's).
The high order 2 bits of a given word indicate whether or not
to look at the next word. The codes are as follows:
11 start of encoded string, 10 continue four more bytes, 01 end of
encoded, and 00 indicates whole sequence encoded in one word.
After the high order 2 bits, the next most significant bits indicate
the format of the current word. There are nine possibilities:
00 - 1 28 bit number, 01 - 2 14 bit numbers, 10 - 3 9 bit numbers,
1100 - 4 6 bit numbers, 1101 - 5 5 bit numbers, 1110 6 4 bit numbers,
11110 - 7 3 bit numbers, 111110 - 12 2 bit numbers, 111111 - 24 1 bit
numbers.
Parses a provided string to make a DOM object. First tries to parse
using XML and if this fails uses the more robust HTML Dom parser
and manipulates the resulting DOM tree to make correspond to original
tags for XML that isn't HTML
Parameters
string
$to_parse
the string to parse a DOMDocument from
Returns
\seekquarry\yioop\library\DOMDocument
—
pased on the provides string
toHexString()
toHexString(string $str) : string
Converts a string to string where each char has been replaced by its
hexadecimal equivalent
Parameters
string
$str
what we want rewritten in hex
Returns
string
—
the hexified string
toIntString()
toIntString(string $str) : string
Converts a string to string where each char has been replaced by a Integer
equivalent
Parameters
string
$str
what we want rewritten in hex
Returns
string
—
the hexified string
toBinString()
toBinString(string $str) : string
Converts a string to string where each char has been replaced by its
binary equivalent
Parameters
string
$str
what we want rewritten in hex
Returns
string
—
the binary string
metricToInt()
metricToInt(string $metric_num) : integer
Converts a string of the form some int followed by K, M, or G.
into its integer equivalent. For example 4K would become 4000,
16M would become 16000000, and 1G would become 1000000000
Note not using base 2 for K, M, G
Parameters
string
$metric_num
metric number to convert
Returns
integer
—
number the metric string corresponded to
intToMetric()
intToMetric(integer $num) : string
Converts a number to a string followed by nothing, K, M, G, T
depending on whether number is < 1000, < 10^6, < 10^9, or < 10^(12)
name of log file in the LOG_DIR directory, rotated logs
will also use this as their basename followed by a number followed by
gzipped (since they are gzipped (older versions of Yioop used bzip
Some distros don't have bzip but do have gzip. Also gzip was
being used elsewhere in Yioop, so to remove the dependency bzip was
replaced )).
boolean
$check_process_handler
whether or not to call the processHandler
to check how long the code has run since the last time processHandler
called.
crawlTimeoutLog()
crawlTimeoutLog(mixed $msg)
Writes a log message $msg if more than LOG_TIMEOUT time has passed since
the last time crawlTimeoutLog was callled. Useful in loops to write a message
as progress is made through the loop (but not on every iteration, but
say every 30 seconds).
Parameters
mixed
$msg
usually a string with what to be printed out after the
timeout period. If $msg === true then clears the timout cache
Computes an 8 byte hash of a string for use in storing documents.
An eight byte hash was chosen so that the odds of collision even for
a few billion documents via the birthday problem are still reasonable.
If the raw flag is set to false then an 11 byte base64 encoding of the
8 byte hash is returned. The hash is calculated as the xor of the
two halves of the 16 byte md5 of the string. (8 bytes takes less storage
which is useful for keeping more doc info in memory)
Used to create a 20 byte hash of a string (typically a word or phrase
with a wikipedia page). Format is 8 byte crawlHash of term (md5 of term
two halves XOR'd), followed by a \x00, followed by the first 11 characters
from the term. If there are not enough char's to make 20 bytes, then the
string is padded with \x00s to 20bytes.
Parameters
string
$string
word to hash
boolean
$raw
whether to base64Hash the result
Returns
string
—
first 8 bytes of md5 of $string concatenated with \x00
to indicate the hash is of a word not a phrase concatenated with the
padded to 11 byte $meta_string.
Used to compute all hashes for a phrase based on each possible cond_max
point. Here cond_max is the location of a substring of a phase which is
maximal.
Given a string makes an 20 byte hash path - where first 8 bytes is
a hash of the string before path start, last 12 bytes is the path
given by splitting on space and separately hashing each element
according to the number of elements and the 3bit selector below:
general format: (64 bit lead word hash, 3bit selector, hashes of rest of
words) according to:
Selector Bits for each remaining word
001 29 32 32
010 29 16 16 16 16
011 29 16 16 8 8 8 8
100 29 16 16 8 8 4 4 4 4
101 29 16 16 8 8 4 4 2 2 2 2
110 29 16 16 8 8 4 4 2 2 1 1 1 1
If $path_start is 0 behaves like crawlHashWord(). The above encoding is
typically used to make word_ids for whole phrases, to make word id's
for single words, the format is
(64 bits for word, 1 byte null, then ignored 11 bytes ).
Parameters
string
$string
what to hash
integer
$path_start
what to use as the split between 5 byte front
hash and the rest
Used by a controller to take a table and return those rows in the
table that a given queue_server would be responsible for handling
Parameters
array
$table
an array of rows of associative arrays which
a queue_server might need to process
string
$field
column of $table whose values should be used
for partitioning
integer
$num_partition
number of queue_servers to choose between
integer
$instance
the id of the particular server we are interested
in
object
$callback
function or static method that might be
applied to input before deciding the responsible queue_server.
For example, if input was a url we might want to get the host
before deciding on the queue_server
Returns
array
—
the reduced table that the $instance queue_server is
responsible for
Used by a controller to say which queue_server should receive
a given input
Parameters
string
$input
can view as a key that might be processes by a
queue_server. For example, in some cases input might be
a url and we want to determine which queue_server should be
responsible for queuing that url
integer
$num_partition
number of queue_servers to choose between
object
$callback
function or static method that might be
applied to input before deciding the responsible queue_server.
For example, if the input was a url we might want to get the host
before deciding on the queue_server
Checks that a timestamp is within the time interval given by a
start time (HH:mm) and a duration
Parameters
string
$start_time
string of the form (HH:mm)
string
$duration
string containting an int in seconds
integer
$time
a Unix timestamp.
Returns
integer
—
-1 if the time of day of $time is not within the given interval.
Otherwise, the Unix timestamp at which the interval will be over for
the same day as $time.
convertPixels()
convertPixels(string $value) : integer
Converts a CSS unit string into its equivalent in pixels. This is
used by @see SvgProcessor.
Parameters
string
$value
a number followed by a legal CSS unit
Returns
integer
—
a number in pixels
makePath()
makePath(string $path) : boolean
Creates folders along a filesystem path if they don't exist
Parameters
string
$path
a file system path
Returns
boolean
—
success or failure
deleteFileOrDir()
deleteFileOrDir(string $file_or_dir)
This is a callback function used in the process of recursively deleting a
directory
Parameters
string
$file_or_dir
the filename or directory name to be deleted
setWorldPermissions()
setWorldPermissions(string $file)
This is a callback function used in the process of recursively chmoding to
777 all files in a folder
Checks if class_1 is the same as class_2 or has class_2 as a parent
Behaves like 3 param version (last param true) of PHP is_a function
that came into being with Version 5.3.9.
Parameters
mixed
$class_1
object or string class name to see if in class2
mixed
$class_2
object or string class name to see if contains class1
Computes a Unix-style diff of two strings. That is it only
outputs lines which disagree between the two strings. It outputs +line
if a line occurs in the second but not first string and -line if a
line occurs in the first string but not the second.
Parameters
string
$data1
first string to compare
string
$data2
second string to compare
boolean
$html
whether to output html highlighting
Returns
string
—
respresenting info about where $data1 and $data2 don't match
Extracts from a table of longest common sequence moves (probably calculated
by @see computeLCS) and a starting coordinate $i, $j in that table,
a longest common subsequence
Parameters
array
$lcs_moves
a table of move computed by computeLCS
array
$lines
from first of the two arrays computing LCS of
integer
$i
a line number in string 1
integer
$j
a line number in string 2
integer
$offset
a number to add to each line number output into $lcs.
This is useful if we have trimmed off the initially common lines from
our two strings we are trying to compute the LCS of
\seekquarry\yioop\library\array&
$lcs
an array of triples
(index_string1, index_string2, line)
the indexes indicate the line number in each string, line is the line
in common the two strings