\seekquarry\yioop\locale\es\resourcesTokenizer

Spanish specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for Spanish locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods
Properties
Constants
segment()
stopwordsRemover()
stem()
$no_stem_list
$stop_words
No constants found
No protected methods found
No protected properties found
N/A
computeRegions()
step0()
step1()
step2a()
step2b()
step3()
removeAccents()
$vowel
$buffer
$rv
$rv_index
$r1
$r1_index
$r2
$r2_index
N/A

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

$vowel

$vowel : string

Spanish vowels

Type

string

$buffer

$buffer : string

Storage used in computing the stem

Type

string

$rv

$rv : string

$rv is approximately the string after the first vowel in the $word we want to stem

Type

string

$rv_index

$rv_index : integer

Position in $word to stem of $rv

Type

integer

$r1

$r1 : string

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel.

Type

string

$r1_index

$r1_index : integer

Position in $word to stem of $r1

Type

integer

$r2

$r2 : string

$r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

Type

string

$r2_index

$r2_index : integer

Position in $word to stem of $r2

Type

integer

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a French word

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

computeRegions()

computeRegions() 

This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

step0()

step0() 

Remove attached pronouns

step1()

step1() 

Standard suffix removal

step2a()

step2a() 

Stem verb suffixes beginning y

step2b()

step2b() 

Stem other verb suffixes

step3()

step3() 

Delete residual suffixes

removeAccents()

removeAccents() 

Un-accent end