\seekquarry\yioop\locale\fr_FR\resourcesTokenizer

This class has a collection of methods for French locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org and was inspired by http://snowball.tartarus.org/otherlangs/french_javascript.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods
Properties
Constants
segment()
stopwordsRemover()
stem()
$no_stem_list
$stop_words
No constants found
No protected methods found
No protected properties found
N/A
computeNonVowels()
computeNonVowelRegions()
step1()
step2a()
step2b()
step3()
step4()
step5()
step6()
$vowel
$buffer
$rv
$rv_index
$r1
$r1_index
$r2
$r2_index
N/A

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

$vowel

$vowel : string

French vowels

Type

string

$buffer

$buffer : string

Storage used in computing the stem

Type

string

$rv

$rv : string

$rv is approximately the string after the first vowel in the $word we want to stem

Type

string

$rv_index

$rv_index : integer

Position in $word to stem of $rv

Type

integer

$r1

$r1 : string

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

Type

string

$r1_index

$r1_index : integer

Position in $word to stem of $r1

Type

integer

$r2

$r2 : string

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

Type

string

$r2_index

$r2_index : integer

Position in $word to stem of $r2

Type

integer

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a French word

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

computeNonVowels()

computeNonVowels() 

If a vowel shouldn't be treated as a volume it is capitalized by this method. (Operations done on buffer.)

computeNonVowelRegions()

computeNonVowelRegions() 

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

step1()

step1() 

Standard suffix removal

step2a()

step2a(string  $ori_word) 

Stem verb suffixes beginning i

Parameters

string $ori_word

original word before stemming

step2b()

step2b() 

Stem other verb suffixes

step3()

step3() 

Gets rid of cedille's (make c's) and words ending with Y (make i)

step4()

step4() 

If the word ends in an s, not preceded by a, i, o, u, è or s, delete it.

step5()

step5() 

Un-double letter end

step6()

step6() 

Un-accent end