\seekquarry\yioop\locale\de\resourcesTokenizer

German specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

This class has a collection of methods for German locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation). The stemmer is my stab at re-implementing the stemmer algorithm given at http://snowball.tartarus.org/algorithms/german/stemmer.html Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods

Properties

Constants

segment()
stopwordsRemover()
stem()

$no_stem_list
$stop_words

No constants found

No protected methods found

No protected properties found

N/A

prelude()
markRegions()
backwardSuffix()
postlude()

$vowel
$s_ending
$st_ending
$r1
$r1_index
$r2
$r2_index
$buffer

N/A

File: src/locale/de/resources/Tokenizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\locale\de\resources\Tokenizer

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words :

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

Methods

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string

$pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed

$data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of a German word

Parameters

string

$word

the string to stem

Returns

string —

the stem of $words

prelude()

prelude()

Upper u and y between vowels so won't be treated as a vowel for the purpose of this algorithm. Maps ß to ss.

markRegions()

markRegions()

Computes locations of rv - RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. , r1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel and R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel.

backwardSuffix()

backwardSuffix()

Used to strip suffixes off word

postlude()

postlude()

Convert captitalized U and Y back to lower-case get rid of any dots above vowels