$no_stem_list
$no_stem_list : array
Words we don't want to be stemmed
Spanish specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
This class has a collection of methods for Spanish locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
segment(string $pre_segment) : string
Stub function which could be used for a word segmenter.
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
string | $pre_segment | before segmentation |
should return string with words separated by space in this case does nothing
computeRegions()
This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.
$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel