\seekquarry\yioop\libraryNWordGrams

Library of functions used to create and extract n word grams

Summary

Methods

Properties

Constants

ngramsContains()
makeNWordGramsFilterFile()
makeSegmentFilterFile()
makeNWordGramsTextFile()

No public properties found

BLOCK_SIZE
FILTER_SUFFIX
TEXT_SUFFIX
AUX_SUFFIX
WIKI_DUMP_REDIRECT
WIKI_DUMP_TITLE
PAGE_COUNT_WIKIPEDIA
PAGE_COUNT_WIKTIONARY

No protected methods found

$ngrams

N/A

No private methods found

No private properties found

N/A

File: src/library/NWordGrams.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\NWordGrams

Constants

BLOCK_SIZE

BLOCK_SIZE

How many bytes to read in one go from wiki file when creating filter

FILTER_SUFFIX

FILTER_SUFFIX

Suffix appended to language tag to create the filter file name containing bigrams.

TEXT_SUFFIX

TEXT_SUFFIX

Suffix appended to language tag to create the text file name containing bigrams.

AUX_SUFFIX

AUX_SUFFIX

Auxiliary suffice file ngrams to add to filter

WIKI_DUMP_REDIRECT

WIKI_DUMP_REDIRECT

WIKI_DUMP_TITLE

WIKI_DUMP_TITLE

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKTIONARY

PAGE_COUNT_WIKTIONARY

Properties

$ngrams

$ngrams : object

Static copy of n-grams files

Type

object

Methods

ngramsContains()

ngramsContains(  $phrase, string  $lang, string  $filter_prefix = 2) : true

Says whether or not phrase exists in the N word gram Bloom Filter

Parameters

	$phrase	what to check if is a bigram
string	$lang	language of bigrams file
string	$filter_prefix	either the word "segment", "all", or number n of the number of words in an ngram in filter.

Returns

true —

or false

makeNWordGramsFilterFile()

makeNWordGramsFilterFile(string  $lang, string  $num_gram, integer  $num_ngrams_found, integer  $max_gram_len = 2) : \seekquarry\yioop\library\none

Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.

The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.

Parameters

string	$lang	locale to be used to stem n grams.
string	$num_gram	value of n in n-gram (how many words in sequence should constitute a gram)
integer	$num_ngrams_found	count of n word grams in text file.
integer	$max_gram_len	value n of longest n gram to be added.

Returns

\seekquarry\yioop\library\none

makeSegmentFilterFile()

makeSegmentFilterFile(string  $dict_file, string  $lang)

Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php

Parameters

string	$dict_file	file to use as a dictionary to make filter from
string	$lang	locale tag of locale we are building the filter for

makeNWordGramsTextFile()

makeNWordGramsTextFile(string  $wiki_file, string  $lang, string  $locale, integer  $num_gram = 2, integer  $ngram_type = self::PAGE_COUNT_WIKIPEDIA, integer  $max_terms = -1) : integer

Generates a n word grams text file from input wikipedia xml file.

The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.

Parameters

string	$wiki_file	compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files
string	$lang	Language to be used to create n grams.
string	$locale	Locale to be used to store results.
integer	$num_gram	number of words in grams we are looking for
integer	$ngram_type	where in Wiki Dump to extract grams from
integer	$max_terms	maximum number of n-grams to compute and put in file

Returns

integer —

$num_ngrams_found count of n-grams in text file.