\seekquarry\yioop\library\summarizersCentroidWeightedSummarizer

Class which may be used by TextProcessors to get a summary for a text document that may later be used for indexing. This is done by the @see getSummmary method. To generate a summary a normalized term frequency vector is computed for each sentence. An average vector is then computed by summing these and renormalizing the result.

The computation of this average vector is biased by weighting earlier sentences vectors more when computing the sum of vectors. This is done using weight coming from a Zipf like distribution. Once an average sentence is obtained, then sentences are score against it using a residual cosine similarity score. I.e., the most important sentence is determined by cosine rank. Then the components of this sentence in the direction of the average sentence is deleted from the average sentence. and the next most important sentence is computed by ranking against this new average sentence vector and so on.

Summary

Methods

Properties

Constants

getSummary()
getPunctuatedUnpunctuatedSentences()
getSentences()
formatSentence()
formatDoc()
pageProcessing()
removeStopWords()
removePunctuation()
getTermsFromSentences()
computeTermFrequenciesPerSentence()
getTermFrequencies()
wordCloudFromSummary()
wordCloudFromTermVector()
getSummaryFromSentenceScores()
numSentencesForSummary()
getAverageSentence()
scoreSentencesVersusAverage()

No public properties found

MAX_DISTINCT_TERMS
CENTROID_COMPONENTS
WORD_CLOUD_LEN

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/summarizers/CentroidWeightedSummarizer.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\summarizers\Summarizer

\seekquarry\yioop\library\summarizers\CentroidWeightedSummarizer

Constants

MAX_DISTINCT_TERMS

MAX_DISTINCT_TERMS

Number of distinct terms to use in generating summary

CENTROID_COMPONENTS

CENTROID_COMPONENTS

Number of nonzero centroid components

WORD_CLOUD_LEN

WORD_CLOUD_LEN

Number of words in word cloud

Methods

getSummary()

getSummary(object  $dom, string  $page, string  $lang) : array

Generates a summary, word cloud, and summary scores based on the closeness of normalized term frequency vectors to an average term frequency vector for sentences.

Parameters

object	$dom	document object model of page to summarize
string	$page	complete raw page to generate the summary from.
string	$lang	language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Returns

array —

a triple (string summary, array word cloud, array of position => scores for positions within the summary)

getPunctuatedUnpunctuatedSentences()

getPunctuatedUnpunctuatedSentences(object  $dom, string  $content, string  $lang) : array

Breaks any content into sentences with and without punctuation

Parameters

object	$dom	a document object to extract a description from.
string	$content	complete page.
string	$lang	local tag of the language for data being processed

Returns

array —

array [sentences_with_punctuation, sentences_with_punctuation_stripped]

getSentences()

getSentences(string  $content) : array

Breaks any content into sentences by splitting it on spaces or carriage returns

Parameters

string

$content

complete page.

Returns

array —

array of sentences from that content.

formatSentence()

formatSentence(string  $sentence) : string

Formats the sentences to remove all characters except words, digits and spaces

Parameters

string

$sentence

complete page.

Returns

string —

formatted sentences.

formatDoc()

formatDoc(string  $content) : string

Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.

The formatted document generated by this function is only used to compute centroid.

Parameters

string

$content

formatted page.

Returns

string —

formatted document.

pageProcessing()

pageProcessing(string  $page) : string

This function does an additional processing on the page such as removing all the tags from the page

Parameters

string

$page

complete page.

Returns

string —

processed page.

removeStopWords()

removeStopWords(array  $sentences, object  $stop_obj) : array

Returns a new array of sentences without the stop words

Parameters

array	$sentences	the array of sentences to process
object	$stop_obj	the class that has the stopworedRemover method

Returns

array —

a new array of sentences without the stop words

removePunctuation()

removePunctuation(array  $sentences) : array

Remove punctuation from an array of sentences

Parameters

array

$sentences

the sentences in the doc

Returns

array —

the array of sentences with the punctuation removed

getTermsFromSentences()

getTermsFromSentences(array  $sentences, string  $lang) : array

Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.

Parameters

array	$sentences	the sentences in the doc
string	$lang	locale tag for stemming

Returns

array —

an array of terms in the array of sentences

computeTermFrequenciesPerSentence()

computeTermFrequenciesPerSentence(array  $sentences, string  $lang) : array

Splits sentences into terms and returns [array of terms, array normalized term frequencies]

Parameters

array	$sentences	the array of sentences to process
string	$lang	the current locale

Returns

array —

an array with [array of terms, array normalized term frequencies] pairs

getTermFrequencies()

getTermFrequencies(array  $terms, mixed  $sentence_or_sentences) : array

Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences

Parameters

array	$terms	the list of all terms in the doc
mixed	$sentence_or_sentences	either a single string sentence or an array of sentences

Returns

array —

sequence of term => frequency pairs

wordCloudFromSummary()

wordCloudFromSummary(string  $summary, string  $lang, array  $term_frequencies = null) : array

Generates an array of most important words from a string $summary.

Currently, the algorithm is a based on terms frequencies after stopwords removed

Parameters

string	$summary	text to derive most important words of
string	$lang	locale tag for language of $summary
array	$term_frequencies	a supplied list of terms and frequencies for words in summary. If null then these will be computed.

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

wordCloudFromTermVector()

wordCloudFromTermVector(array  $term_vector, mixed  $terms = false) : array

Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms

Parameters

array	$term_vector	if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.
mixed	$terms	if not false, then should be an array of terms, at a minimum having all the indices of $term_vector

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

getSummaryFromSentenceScores()

getSummaryFromSentenceScores(array  $sentence_scores, array  $sentences, string  $lang) : string

Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.

Parameters

array	$sentence_scores	an array sorted by score of sentence_index => score pairs.
array	$sentences	the array of sentences corresponding to sentence $sentence_scores indices
string	$lang	language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Returns

string —

a string that represents the summary

numSentencesForSummary()

numSentencesForSummary(array  $sentence_scores, array  $sentences) : integer

Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.

Parameters

array	$sentence_scores	associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).
array	$sentences	sentences in doc in their original order

Returns

integer —

number of sentences

getAverageSentence()

getAverageSentence(array  $term_frequencies_normalized) : array

Computes an average sentence by adding the normalized term frequency vectors for each sentence weighted by a Zipf like distrbution on sentence index and normalizing the resulting vector

Parameters

array

$term_frequencies_normalized

the array with the terms as the key and its normalized frequency as the value

Returns

array —

a normalized vector of term => weights

scoreSentencesVersusAverage()

scoreSentencesVersusAverage(array  $sentence_vectors, array  $average_sentence) : array

Computes scores for each sentence => word vector in an array of sentence => word_vectors based on on how it compares versus an average sentence word vector Here word vectors are normalized vectors and scores are determined by inner product.

Parameters

array	$sentence_vectors	the array with the terms as the key and its normalized frequency as the value
array	$average_sentence	an array of each words average frequency value

Returns

array —

array of sentence index => score pairs