\seekquarry\yioop\locale\es\resourcesTokenizer

Spanish specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for Spanish locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods

Properties

Constants

segment()
stopwordsRemover()
stem()

$no_stem_list
$stop_words

No constants found

No protected methods found

No protected properties found

N/A

computeRegions()
step0()
step1()
step2a()
step2b()
step3()
removeAccents()

$vowel
$buffer
$rv
$rv_index
$r1
$r1_index
$r2
$r2_index

N/A

File: src/locale/es/resources/Tokenizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\locale\es\resources\Tokenizer

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words :

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string

$pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed

$data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a French word

Parameters

string

$word

the string to stem

Returns

string —

the stem of $words

computeRegions()

computeRegions()

This computes the three regions of the word rv, r1, and r2 used in the rest of the stemmer $rv is defined as follows: If the second letter is a consonant, $rv is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter. But RV is the end of the word if these positions cannot be found.

$r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. $r2 is the region after the first non-vowel following a vowel in $r1, or the end of the word if there is no such non-vowel

step0()

step0()

Remove attached pronouns

step1()

step1()

Standard suffix removal

step2a()

step2a()

Stem verb suffixes beginning y

step2b()

step2b()

Stem other verb suffixes

step3()

step3()

Delete residual suffixes

removeAccents()

removeAccents()

Un-accent end