BLOCK_SIZE
BLOCK_SIZE
How many bytes to read in one go from wiki file when creating filter
Library of functions used to create and extract n word grams
ngramsContains( $phrase, string $lang, string $filter_prefix = 2) : true
Says whether or not phrase exists in the N word gram Bloom Filter
$phrase | what to check if is a bigram |
|
string | $lang | language of bigrams file |
string | $filter_prefix | either the word "segment", "all", or number n of the number of words in an ngram in filter. |
or false
makeNWordGramsFilterFile(string $lang, string $num_gram, integer $num_ngrams_found, integer $max_gram_len = 2) : \seekquarry\yioop\library\none
Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.
The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.
string | $lang | locale to be used to stem n grams. |
string | $num_gram | value of n in n-gram (how many words in sequence should constitute a gram) |
integer | $num_ngrams_found | count of n word grams in text file. |
integer | $max_gram_len | value n of longest n gram to be added. |
makeSegmentFilterFile(string $dict_file, string $lang)
Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php
string | $dict_file | file to use as a dictionary to make filter from |
string | $lang | locale tag of locale we are building the filter for |
makeNWordGramsTextFile(string $wiki_file, string $lang, string $locale, integer $num_gram = 2, integer $ngram_type = self::PAGE_COUNT_WIKIPEDIA, integer $max_terms = -1) : integer
Generates a n word grams text file from input wikipedia xml file.
The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.
string | $wiki_file | compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files |
string | $lang | Language to be used to create n grams. |
string | $locale | Locale to be used to store results. |
integer | $num_gram | number of words in grams we are looking for |
integer | $ngram_type | where in Wiki Dump to extract grams from |
integer | $max_terms | maximum number of n-grams to compute and put in file |
$num_ngrams_found count of n-grams in text file.