\seekquarry\yioop\locale\pt\resourcesTokenizer

This class has a collection of methods for Portuguese locale specific tokenization. In particular, it has a stemmer implementing the Snowball Stemming algorithm presented in http://snowball.tartarus.org/algorithms/portuguese/stemmer.html

Summary

Methods
Properties
Constants
stopwordsRemover()
segment()
stem()
$stop_words
$semantic_rewrites
No constants found
No protected methods found
No protected properties found
N/A
step1()
step2()
step3()
step4()
step5()
findR1()
findRV()
mbStringToArray()
$buffer
$k
$r1
$r2
$rv
N/A

Properties

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

$semantic_rewrites

$semantic_rewrites : array

Phrases we would like yioop to rewrite before performing a query

Type

array

$buffer

$buffer : string

storage used in computing the stem

Type

string

$k

$k : integer

Index of the current end of the word at the current state of computing its stem

Type

integer

$r1

$r1 : string

R1 is the region in the word after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel

Type

string

$r2

$r2 : string

R2 is the region in the R1 after the first non-vowel following a vowel, or is the null region at the end of the word if there is no non-vowel

Type

string

$rv

$rv : string

If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.

Type

string

Methods

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stem()

stem(string  $word) : string

Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

step1()

step1(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Standard Suffix Removal Step It search for longest suffix from given set and remove if found

Parameters

string $word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step2()

step2(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Verb Suffix Removal Step If step 1 does not change anything than this function will be called

It will also check for longest suffix from the suffix set Remove if found

Parameters

string $word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step3()

step3(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Delete suffix i if in RV and preceded by c

Parameters

string $word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step4()

step4(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Residual suffix If the word ends with one of [os a i o á í ó] in RV

Parameters

string $word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step5()

step5(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Residual suffix If the word ends with one of [e é ê] in RV

Parameters

string $word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

findR1()

findR1(string  $word) : string

This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel

Parameters

string $word

Returns

string —

$r1 region

findRV()

findRV(string  $word) : string

This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.

Parameters

string $word

Returns

string —

$rv region

mbStringToArray()

mbStringToArray(string  $string) : array

This method will break-up a multibyte string into its individual characters and generate an array of characters

Parameters

string $string

of multibyte characters to break-up

Returns

array —

of multibyte characters