\seekquarry\yioop\locale\pt\resourcesTokenizer

This class has a collection of methods for Portuguese locale specific tokenization. In particular, it has a stemmer implementing the Snowball Stemming algorithm presented in http://snowball.tartarus.org/algorithms/portuguese/stemmer.html

Summary

Methods

Properties

Constants

stopwordsRemover()
segment()
stem()

$stop_words
$semantic_rewrites

No constants found

No protected methods found

No protected properties found

N/A

step1()
step2()
step3()
step4()
step5()
findR1()
findRV()
mbStringToArray()

$buffer
$k
$r1
$r2
$rv

N/A

File: src/locale/pt/resources/Tokenizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\locale\pt\resources\Tokenizer

Properties

$stop_words

$stop_words :

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

Methods

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed

$data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string

$pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stem()

stem(string  $word) : string

Computes the stem of an Portuguese word For example, química, químicas, químico, químicos all have químic as a stem

Parameters

string

$word

the string to stem

Returns

string —

the stem of $words

step1()

step1(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Standard Suffix Removal Step It search for longest suffix from given set and remove if found

Parameters

string

$word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step2()

step2(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Verb Suffix Removal Step If step 1 does not change anything than this function will be called

It will also check for longest suffix from the suffix set Remove if found

Parameters

string

$word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step3()

step3(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Delete suffix i if in RV and preceded by c

Parameters

string

$word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step4()

step4(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Residual suffix If the word ends with one of [os a i o á í ó] in RV

Parameters

string

$word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

step5()

step5(string  $word) : \seekquarry\yioop\locale\pt\resources\processed

Residual suffix If the word ends with one of [e é ê] in RV

Parameters

string

$word

the string to suffix removal

Returns

\seekquarry\yioop\locale\pt\resources\processed —

string

findR1()

findR1(string  $word) : string

This method will find R1 region in the $word R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel

Parameters

string

$word

Returns

string —

$r1 region

findRV()

findRV(string  $word) : string

This method will find RV region in the $word If the second letter is a consonant, RV is the region after the next following vowel, or if the first two letters are vowels, RV is the region after the next consonant, and otherwise (consonant-vowel case) RV is the region after the third letter.

Parameters

string

$word

Returns

string —

$rv region

mbStringToArray()

mbStringToArray(string  $string) : array

This method will break-up a multibyte string into its individual characters and generate an array of characters

Parameters

string

$string

of multibyte characters to break-up

Returns

array —

of multibyte characters