\seekquarry\yioop\libraryNWordGrams

Library of functions used to create and extract n word grams

Summary

Methods
Properties
Constants
ngramsContains()
makeNWordGramsFilterFile()
makeSegmentFilterFile()
makeNWordGramsTextFile()
No public properties found
BLOCK_SIZE
FILTER_SUFFIX
TEXT_SUFFIX
AUX_SUFFIX
WIKI_DUMP_REDIRECT
WIKI_DUMP_TITLE
PAGE_COUNT_WIKIPEDIA
PAGE_COUNT_WIKTIONARY
No protected methods found
$ngrams
N/A
No private methods found
No private properties found
N/A

Constants

BLOCK_SIZE

BLOCK_SIZE

How many bytes to read in one go from wiki file when creating filter

FILTER_SUFFIX

FILTER_SUFFIX

Suffix appended to language tag to create the filter file name containing bigrams.

TEXT_SUFFIX

TEXT_SUFFIX

Suffix appended to language tag to create the text file name containing bigrams.

AUX_SUFFIX

AUX_SUFFIX

Auxiliary suffice file ngrams to add to filter

WIKI_DUMP_REDIRECT

WIKI_DUMP_REDIRECT

WIKI_DUMP_TITLE

WIKI_DUMP_TITLE

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKTIONARY

PAGE_COUNT_WIKTIONARY

Properties

$ngrams

$ngrams : object

Static copy of n-grams files

Type

object

Methods

ngramsContains()

ngramsContains(  $phrase, string  $lang, string  $filter_prefix = 2) : true

Says whether or not phrase exists in the N word gram Bloom Filter

Parameters

$phrase

what to check if is a bigram

string $lang

language of bigrams file

string $filter_prefix

either the word "segment", "all", or number n of the number of words in an ngram in filter.

Returns

true —

or false

makeNWordGramsFilterFile()

makeNWordGramsFilterFile(string  $lang, string  $num_gram, integer  $num_ngrams_found, integer  $max_gram_len = 2) : \seekquarry\yioop\library\none

Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.

The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.

Parameters

string $lang

locale to be used to stem n grams.

string $num_gram

value of n in n-gram (how many words in sequence should constitute a gram)

integer $num_ngrams_found

count of n word grams in text file.

integer $max_gram_len

value n of longest n gram to be added.

Returns

\seekquarry\yioop\library\none

makeSegmentFilterFile()

makeSegmentFilterFile(string  $dict_file, string  $lang) 

Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php

Parameters

string $dict_file

file to use as a dictionary to make filter from

string $lang

locale tag of locale we are building the filter for

makeNWordGramsTextFile()

makeNWordGramsTextFile(string  $wiki_file, string  $lang, string  $locale, integer  $num_gram = 2, integer  $ngram_type = self::PAGE_COUNT_WIKIPEDIA, integer  $max_terms = -1) : integer

Generates a n word grams text file from input wikipedia xml file.

The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.

Parameters

string $wiki_file

compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files

string $lang

Language to be used to create n grams.

string $locale

Locale to be used to store results.

integer $num_gram

number of words in grams we are looking for

integer $ngram_type

where in Wiki Dump to extract grams from

integer $max_terms

maximum number of n-grams to compute and put in file

Returns

integer —

$num_ngrams_found count of n-grams in text file.