\seekquarry\yioop\locale\ru\resourcesTokenizer

This class has a collection of methods for Russian locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is a modification (with bug fixes ) of Dennis Kreminsky's stemmer from: http://snowball.tartarus.org/otherlangs/russian_php5.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods
Properties
Constants
segment()
stopwordsRemover()
stem()
$no_stem_list
$stop_words
CHAR_LENGTH
No protected methods found
No protected properties found
N/A
rv()
step1()
step2()
step3()
step4()
No private properties found
N/A

Constants

CHAR_LENGTH

CHAR_LENGTH

Num bytes of Russian unicode char.

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a Russian word

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

rv()

rv(string  $word) : array

Compute the RV region of a word. RV is the region after the first vowel, or the end of the word if it contains no vowel.

Parameters

string $word

word to compute rv regions for

Returns

array —

pair string before rv, string after rv

step1()

step1(string  $word) : string

Search for a PERFECTIVE GERUND ending. If one is found remove it, and that is then the end of step 1. Otherwise try and remove a REFLEXIVE ending, and then search in turn for (1) an ADJECTIVAL, (2) a VERB or (3) a NOUN ending.

As soon as one of the endings (1) to (3) is found remove it, and terminate step 1.

Parameters

string $word

word to stem

Returns

string —

$word after step

step2()

step2(string  $word) : string

If the word ends with и (i), remove it.

Parameters

string $word

word to stem

Returns

string —

$word after step

step3()

step3(string  $word) : string

Search for a DERIVATIONAL ending in R2 (i.e. the entire ending must lie in R2), and if one is found, remove it.

Parameters

string $word

word to stem

Returns

string —

$word after step

step4()

step4(string  $word) : string

1) Undouble н (n), or, (2) if the word ends with a SUPERLATIVE ending, remove it and undouble н (n), or (3) if the word ends ь (') (soft sign) remove it.

Parameters

string $word

word to stem

Returns

string —

$word after step