\seekquarry\yioop\locale\nl\resourcesTokenizer

This class has a collection of methods for Dutch locale specific tokenization. In particular, it has a stemmer, .

Summary

Methods
Properties
Constants
stopwordsRemover()
segment()
stem()
step3b()
$stop_words
$no_stem_list
$removed_e_suffix
No constants found
No protected methods found
No protected properties found
N/A
removeAllUmlautAndAcuteAccents()
substituteIAndY()
isVowel()
getRIndex()
step1()
step2()
step3a()
step4()
replace()
endsWith()
undouble()
No private properties found
N/A

Properties

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$removed_e_suffix

$removed_e_suffix : array

boolean that tells the code if the e suffix was removed in step2 or not

Type

array

Methods

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stem()

stem(string  $word) : string

Computes the stem of a Dutch word

For example, lichamelijk, lichamelijke, lichamelijkheden and lichamen, all have licham as a stem

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

step3b()

step3b(string  $word, integer  $R2) : string

Search for the longest among the following suffixes, and perform the action indicated.

If in R2 and ends with eigend, eigingm igend or iging remove it If in R2 and ends with ig preceded by an e remove it If in R2 and ends with lijk, baar or bar then remove it

Parameters

string $word

the string to stem

integer $R2

the R index

Returns

string —

the string with the various endings removed if they exist

removeAllUmlautAndAcuteAccents()

removeAllUmlautAndAcuteAccents(string  $word) : string

Remove all umlaut and acute accents that need to be removed.

Parameters

string $word

the string to remove the umlauts and accents from

Returns

string —

the string with the umlauts and accents removed

substituteIAndY()

substituteIAndY(string  $word) : string

Put initial y, y after a vowel, and i between vowels into upper case.

Parameters

string $word

the string to put initial y, y after a vowel, and i between vowels into upper case.

Returns

string —

the string with an initial y, y after a vowel, and i between vowels into upper case.

isVowel()

isVowel(string  $letter) : boolean

Check that the letter is a vowel

Parameters

string $letter

the character to check

Returns

boolean —

true if it is a vowel, otherwise false

getRIndex()

getRIndex(string  $word, integer  $start) : integer

Get the R index. The R index is the first consonent that follows a vowel after the $start index

Parameters

string $word

the string to search for the R index

integer $start

the index to start searching for the R index in the string

Returns

integer —

the R index if found, otherwise -1

step1()

step1(string  $word, integer  $R1) : string

Define a valid en-ending as a non-vowel, and not gem and remove it

Parameters

string $word

the string to stem

integer $R1

the int that represents the R index

Returns

string —

the string with the valid en-ending as a non-vowel, and not gem ending removed

step2()

step2(string  $word) : string

Delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Parameters

string $word

the string to delete the suffix e if in R1 and preceded by a non-vowel, and then undouble the ending

Returns

string —

the string with the suffix e if in R1 and preceded by a non-vowel deleted, and then undouble the ending

step3a()

step3a(string  $word, integer  $R2) : string

Delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

Parameters

string $word

the string to delete the letters heid if in R2 and not preceded by a c, and treat an a preceding en like in step 1

integer $R2

the R index

Returns

string —

the string with the letters heid if in R2 and not preceded by a c deleted, and treated an a preceding en like in step 1

step4()

step4(string  $word) : string

If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, moom -> mon, weed -> wed).

Parameters

string $word

the string to check for the CVD combination

Returns

string —

the string with the CVD combination removed otherwise the original string

replace()

replace(string  $word, string  $regex, string  $replace, integer  $offset) : string

Replace a string based on a regex expression

Parameters

string $word

the string to search for regex replacement

string $regex

the regex to use to find and replacement

string $replace

the string to replace if the pattern is matched

integer $offset

the int to start to look for the regex replacement

Returns

string —

the string with the characters replaced if the regex matches, otherwise the original string

endsWith()

endsWith(string  $haystack, string  $needle, boolean  $case = true) : boolean

Checks to see if a string ends with a certain string

Parameters

string $haystack

the string to check

string $needle

the string to match at the end

boolean $case

whether the check should be case insensitive or not

Returns

boolean —

true if it ends with $needle, otherwise false

undouble()

undouble(string  $word) : string

undoubles the end of a string. If the string ends in kk, tt, dd remove one of the characters

Parameters

string $word

the string to undouble

Returns

string —

the undoubled string, otherwise the original string