\seekquarry\yioop\locale\fa\resourcesTokenizer

Persian specific tokenization code. In particular, it has a stemmer, The stemmer is a modified variant (handling prefixes slightly differently) of my stab at porting Nick Patch's Perl port, https://metacpan.org/pod/Lingua::Stem::UniNE::FA, of the stemming algorithm by Ljiljana Dolamic and Jacques Savoy of the University of Neuchâtel. The Java version of this is at http://members.unine.ch/jacques.savoy/clef/persianStemmerUnicode.txt (beware of Java's handling of Unicode).

Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods

Properties

Constants

segment()
stopwordsRemover()
stem()

$no_stem_list
$stop_words

No constants found

No protected methods found

No protected properties found

N/A

simplifyPrefix()
removeKasra()
removeSuffix()
normalize()

No private properties found

N/A

File: src/locale/fa/resources/Tokenizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\locale\fa\resources\Tokenizer

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words :

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string

$pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed

$data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a Persian word

Parameters

string

$word

the string to stem

Returns

string —

the stem of $word

simplifyPrefix()

simplifyPrefix(string  $word) : string

Simplifies prefixes beginning with آ to ا

Parameters

string

$word

word to remove mark from

Returns

string —

result of removal

removeKasra()

removeKasra(string  $word) : string

Removes a Kasra diacritic mark if appears at the end of a word.

Parameters

string

$word

word to remove mark from

Returns

string —

result of removal

removeSuffix()

removeSuffix(string  $word) : string

Removes common Persian suffixes

Parameters

string

$word

to remove suffixes from

Returns

string —

result of suffix removal

normalize()

normalize(string  $word) : string

Performs additional end word stripping

Parameters

string

$word

to remove suffixes from

Returns

string —

result of suffix removal