$no_stem_list
$no_stem_list : array
Words we don't want to be stemmed
German specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
This class has a collection of methods for German locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org/algorithms/german/stemmer.html Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
segment(string $pre_segment) : string
Stub function which could be used for a word segmenter.
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
string | $pre_segment | before segmentation |
should return string with words separated by space in this case does nothing
markRegions()
Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.