\seekquarry\yioop\locale\zh_CN\resourcesTokenizer

Chinese specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

Summary

Methods

Properties

Constants

stopwordsRemover()
segment()
isCardinalNumber()
isOrdinalNumber()
isDate()
isPunctuation()
isNotCurrentLang()
createStochasticTermSegmenter()
destroyStochasticTermSegmenter()
getStochasticTermSegmenter()
POSGetKey()
createNER()
destroyNER()
getNER()
createPosTagger()
destoryPosTagger()
getPosTagger()
extractTripletsPhrases()
tagTokenizePartOfSpeech()
parseTypeList()
parseAdjective()
parseDeterminer()
parseNoun()
parseVerb()
parsePrepositionalPhrases()
parseNounPhrase()
parseVerbPhrase()
parseWholePhrase()
extractTripletsParseTree()
extractDeepestSpeechPartPhrase()
extractObjectParseTree()
extractPredicateParseTree()
extractSubjectParseTree()
rearrangeTripletsByType()
extractTripletByType()
questionParser()
isQuestion()
parseQuestion()
questionType()

$stop_words
$non_char_preg
$num_dict
$dot
$num_end
$exception_list
$punctuation_preg
$question_token
$question_words
$adjective_type
$adverb_type
$conjunction_type
$determiner_type
$noun_type
$verb_type
$particle_type

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

$stochasticTermSegmenter
$namedEntityRecognizer
$posTagger

N/A

File: src/locale/zh_CN/resources/Tokenizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\locale\zh_CN\resources\Tokenizer

Tags

author	Chris Pollett

Properties

$stop_words

$stop_words :

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

Tags

array

$non_char_preg

$non_char_preg : string

regular expression to determine if the None of the char in this term is in current language.

Type

string

$num_dict

$num_dict :

The dictionary of characters can be used as Chinese Numbers

Type

Tags

string

$dot

$dot :

Dots used in Chinese Numbers

Type

Tags

string

$num_end

$num_end :

A list of characters can be used at the end of numbers

Type

Tags

string

$exception_list

$exception_list :

Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it

Type

Tags

array	of string

$punctuation_preg

$punctuation_preg :

A list of characters can be used as Chinese punctuations

Type

Tags

string

$question_token

$question_token :

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

Type

Tags

string

$question_words

$question_words :

Words array that determine if a sentence passed in is a question

Type

Tags

array

$adjective_type

$adjective_type :

List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ

Type

Tags

array

$adverb_type

$adverb_type :

List of adverb-like parts of speech that might appear in lexicon file

Type

Tags

array

$conjunction_type

$conjunction_type :

List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS

Type

Tags

array

$determiner_type

$determiner_type :

List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M

Type

Tags

array

$noun_type

$noun_type :

List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN

Type

Tags

array

$verb_type

$verb_type :

List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB

Type

Tags

array

$particle_type

$particle_type :

List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere

Type

Tags

array

$stochasticTermSegmenter

$stochasticTermSegmenter :

Stochastic Term Segmenter instance

Type

Tags

object

$namedEntityRecognizer

$namedEntityRecognizer :

named Entity Recognizer instance

Type

Tags

object

$posTagger

$posTagger :

PosTagger instance

Type

Tags

object

Methods

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed

$data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

segment()

segment(string  $pre_segment, string  $method = "STS") : string

A word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string	$pre_segment	before segmentation
string	$method	indicates which method to use

Returns

string —

with words separated by space

isCardinalNumber()

isCardinalNumber(  $term)

Check if the term passed in is a Cardinal Number

Parameters

$term

isOrdinalNumber()

isOrdinalNumber(  $term)

Parameters

$term

isDate()

isDate(  $term)

Parameters

$term

isPunctuation()

isPunctuation(  $term)

Parameters

$term

isNotCurrentLang()

isNotCurrentLang(  $term) : boolean

Check if all the chars in the term is NOT current language

Parameters

$term

is a string that to be checked

Returns

boolean —

true if all the chars in $term is NOT current language false otherwise

createStochasticTermSegmenter()

createStochasticTermSegmenter(  $cache_pct = 0.06)

Parameters

$cache_pct

destroyStochasticTermSegmenter()

destroyStochasticTermSegmenter()

getStochasticTermSegmenter()

getStochasticTermSegmenter()

POSGetKey()

POSGetKey(  $term)

Parameters

$term

createNER()

createNER()

destroyNER()

destroyNER()

getNER()

getNER()

createPosTagger()

createPosTagger()

Create POSTagger instance

destoryPosTagger()

destoryPosTagger()

getPosTagger()

getPosTagger()

extractTripletsPhrases()

extractTripletsPhrases(array  $word_and_phrase_list) : array

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in orginal phrase) and RAW (removing (adjectives, etc).

Parameters

array

$word_and_phrase_list

of statements

Returns

array —

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

tagTokenizePartOfSpeech()

tagTokenizePartOfSpeech(string  $text) : array

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.

Parameters

string

$text

string to tag and tokenize

Returns

array —

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text

parseTypeList()

parseTypeList(\seekquarry\yioop\locale\zh_CN\resources\array&  $cur_node, array  $tagged_phrase, string  $type) : string

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

Parameters

\seekquarry\yioop\locale\zh_CN\resources\array&	$cur_node	node within parse tree
array	$tagged_phrase	parse tree for phrase
string	$type	self::$noun_type, self::$verb_type, etc

Returns

string —

phrase string involving only terms of that $type

parseAdjective()

parseAdjective(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["cur_node" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseDeterminer()

parseDeterminer(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

parseNoun(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseVerb()

parseVerb(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parsePrepositionalPhrases()

parsePrepositionalPhrases(array  $tagged_phrase, array  $tree, integer  $index = 1) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["cur_node" => current parse position in $tagged_phrase]
integer	$index	which term in $tagged_phrase to start to try to parse a preposition from

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseNounPhrase()

parseNounPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parseVerbPhrase()

parseVerbPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
array	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

parseWholePhrase(array  $tagged_phrase,   $tree,   $tree_np_pre = array()) : array

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

Parameters

array	$tagged_phrase	an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
	$tree	that consists of ["curnode" => current parse position in $tagged_phrase]
	$tree_np_pre	subject found from previous sub-sentence

Returns

array —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

extractTripletsParseTree()

extractTripletsParseTree(\seekquarry\yioop\locale\zh_CN\resources\are  $tree) : array

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

Parameters

\seekquarry\yioop\locale\zh_CN\resources\are

$tree

a parse tree for a sentence

Returns

array —

triplet array

extractDeepestSpeechPartPhrase()

extractDeepestSpeechPartPhrase(array  $tree, string  $pos) : string

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

Parameters

array	$tree	phrase to extract type from
string	$pos	the part of speech to extract

Returns

string —

the label of deepest $pos only path in $tree

extractObjectParseTree()

extractObjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

extractPredicateParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

extractSubjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

rearrangeTripletsByType()

rearrangeTripletsByType(array  $sub_pred_obj_triplets) : array

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

Parameters

array

$sub_pred_obj_triplets

in format described above

Returns

array —

$processed_triplets in format described above

extractTripletByType()

extractTripletByType(array  $sub_pred_obj_triplets, string  $type) : array

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

Parameters

array	$sub_pred_obj_triplets	in format described above
string	$type	either CONCISE or RAW

Returns

array —

$triplets in format described above

questionParser()

questionParser(string  $question) : array

Takes any question started with WH question and returns the triplet from the question

Parameters

string

$question

question to parse

Returns

array —

question triplet

isQuestion()

isQuestion(  $phrase) : boolean

Takes a phrase query entered by user and return true if it is question and false if not

Parameters

$phrase

any statement

Returns

boolean —

returns question word if statement is question

parseQuestion()

parseQuestion(string  $tagged_question, integer  $index, string  $question_word) : array

Takes tagged question string starts with Who and returns question triplet from the question string

Parameters

string	$tagged_question	part-of-speech tagged question
integer	$index	current index in statement
string	$question_word	is the question word need to be replaced

Returns

array —

parsed triplet

questionType()

questionType(  $term_array,   $type_list)

Helper function for isQuestion

Parameters

	$term_array	segmented Chinese terms
	$type_list	currect trace of self::$question_words return ["ques_words"=>ques_words,"types"=>types]