$no_stem_list
$no_stem_list : array
Words we don't want to be stemmed
Persian specific tokenization code. In particular, it has a stemmer, The stemmer is a modified variant (handling prefixes slightly differently) of my stab at porting Nick Patch's Perl port, https://metacpan.org/pod/Lingua::Stem::UniNE::FA, of the stemming algorithm by Ljiljana Dolamic and Jacques Savoy of the University of Neuchâtel. The Java version of this is at http://members.unine.ch/jacques.savoy/clef/persianStemmerUnicode.txt (beware of Java's handling of Unicode).
Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
segment(string $pre_segment) : string
Stub function which could be used for a word segmenter.
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
string | $pre_segment | before segmentation |
should return string with words separated by space in this case does nothing