\seekquarry\yioop\locale\it\resourcesTokenizer

Italian specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

Summary

Methods
Properties
Constants
segment()
stopwordsRemover()
stem()
$no_stem_list
$stop_words
No constants found
No protected methods found
No protected properties found
N/A
checkForSuffix()
in()
r1()
r2()
rv()
getRegions()
isVowel()
maxSuffix()
acuteByGrave()
prelude()
step0()
step1()
step2()
step3a()
step3b()
postlude()
$buffer
$r1_start
$r2_start
$rv_start
$r1_string
$r2_string
$rv_string
$max_suffix_pos
$step1_changes
N/A

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

$buffer

$buffer : string

Storage used in computing the stem

Type

string

$r1_start

$r1_start : integer

Storage used in computing the starting index of region R1

Type

integer

$r2_start

$r2_start : integer

Storage used in computing the starting index of region R2

Type

integer

$rv_start

$rv_start : integer

Storage used in computing the starting index of region RV

Type

integer

$r1_string

$r1_string : string

Storage used in computing region R1

Type

string

$r2_string

$r2_string : string

Storage used in computing region R2

Type

string

$rv_string

$rv_string : string

Storage used in computing Region RV

Type

string

$max_suffix_pos

$max_suffix_pos : integer

Storage for computing the starting position for the longest suffix

Type

integer

$step1_changes

$step1_changes : boolean

Storage used in determinig if step1 removed any endings from the word

Type

boolean

Methods

segment()

segment(string  $pre_segment) : string

This method currently does nothing. For some locales it could used to split strings of the form "thisisastring" into a string with the words seperated: "this is a string"

Parameters

string $pre_segment

string to be segmented

Returns

string —

after segmentation done (same string in this case)

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

stem()

stem(string  $word) : string

Computes the stem of an Italian word Example guardando,guardandogli,guardandola,guardano all stem to guard

Parameters

string $word

is the word to be stemmed

Returns

string —

stem of $word

checkForSuffix()

checkForSuffix(  $parent_string,   $substring) : \seekquarry\yioop\locale\it\resources\$pos

Checks if a string is a suffix for another string

Parameters

$parent_string

is the string in which we wish to find the suffix

$substring

is the suffix we wish to check

Returns

\seekquarry\yioop\locale\it\resources\$pos —

as the starting position of the suffix $substring in $parent_string if it exists, else false

in()

in(string  $string, string  $substring) : boolean

Checks if a string occurs in another string

Parameters

string $string

is the parent string

string $substring

is the string checked to be a sub-string of $string

Returns

boolean —

if $substring is a substring of $string

r1()

r1(  $string) : \seekquarry\yioop\locale\it\resources\$r1_start

Computes the starting index for region R1

Parameters

$string

is the string for which we wish to find the index

Returns

\seekquarry\yioop\locale\it\resources\$r1_start —

as the starting index for R1 for $string

r2()

r2(  $string) : \seekquarry\yioop\locale\it\resources\$r2_start

Computes the starting index for region R2

Parameters

$string

is the string for which we wish to find the index

Returns

\seekquarry\yioop\locale\it\resources\$r2_start —

as the starting index for R1 for $string

rv()

rv(  $string) : \seekquarry\yioop\locale\it\resources\$rv_start

Computes the starting index for region RV

Parameters

$string

is the string for which we wish to find the index

Returns

\seekquarry\yioop\locale\it\resources\$rv_start —

as the starting index for RV for $string

getRegions()

getRegions() 

Computes regions R1, R2 and RV in the form strings. $r1_string, $r2_string, $r3_string for R1,R2 and R3 repectively

isVowel()

isVowel(  $char) : boolean

Checks if a character is a vowel or not

Parameters

$char

is the character to be checked

Returns

boolean —

if $char is a vowel

maxSuffix()

maxSuffix(  $string,   $suffixes) : \seekquarry\yioop\locale\it\resources\$max_suffix

Computes the longest suffix for a given string from a given set of suffixes

Parameters

$string

is the for which the maximum suffix is to be found

$suffixes

is an array of suffixes

Returns

\seekquarry\yioop\locale\it\resources\$max_suffix —

is the longest suffix for $string

acuteByGrave()

acuteByGrave(  $string) : \seekquarry\yioop\locale\it\resources\$string

Replaces all acute accents in a string by grave accents and also handles accented characters

Parameters

$string

is the string from in which the acute accents are to be replaced

Returns

\seekquarry\yioop\locale\it\resources\$string —

with changes

prelude()

prelude() 

Performs the following functions: Replaces acute accents with grave accents Marks u after q and u,i preceded and followed by a vowel as a non vowel by converting to upper case

step0()

step0() 

Handles attached pronoun

step1()

step1() 

Handles standard suffixes

step2()

step2() 

Handles verb suffixes

step3a()

step3a() 

Deletes a final a,e,i,o,a`,e`,i`,o` and a preceding i if in RV

step3b()

step3b() 

Replaces a final ch/gh by c/g if in RV

postlude()

postlude() 

Converts U and/or I back to lowercase