\seekquarry\yioop\library\summarizersCentroidSummarizer

Class which may be used by TextProcessors to get a summary for a text document that may later be used for indexing. This is done by the @see getSummmary method. getSummary does this splitting the document into sentences and computing inverse sentence frequency (should be ISL, but we call IDF) scores for each term. It then computes an average document vector (we call centroid) with components (total number of occurrences of term) * (IDF score of term).

It also generates a word cloud for a document. Notice if we divided this by number of documents, we would have components average term frequency IDF. As ranking by either won't affect out results, we don't divide. We then compute the cosine similarity of each sentence vector with this average and choose the top sentences to make our summary. Here a sentence vector has components term frequency in sentence IDF score of term.

Summary

Methods
Properties
Constants
getSummary()
getPunctuatedUnpunctuatedSentences()
getSentences()
formatSentence()
formatDoc()
pageProcessing()
removeStopWords()
removePunctuation()
getTermsFromSentences()
computeTermFrequenciesPerSentence()
getTermFrequencies()
wordCloudFromSummary()
wordCloudFromTermVector()
getSummaryFromSentenceScores()
numSentencesForSummary()
computeCentroidIdfFromSentences()
scoreSentencesVersusPageTerms()
No public properties found
MAX_DISTINCT_TERMS
CENTROID_COMPONENTS
WORD_CLOUD_LEN
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

MAX_DISTINCT_TERMS

MAX_DISTINCT_TERMS

Number of distinct terms to use in generating summary

CENTROID_COMPONENTS

CENTROID_COMPONENTS

Number of nonzero centroid components

WORD_CLOUD_LEN

WORD_CLOUD_LEN

Number of words in word cloud

Methods

getSummary()

getSummary(object  $dom, string  $page, string  $lang) : array

Generates a summary, word cloud, and sentence scoring for a provides web page. To do this the page is split into sentences and inverse sentence frequency (should be ISL, but we call IDF) scores for each term term are computed. Then an average document vector (we call centroid) with components (total number of occurrences of term) * (IDF score of term) is found. We then compute the cosine similarity of each sentence vector with this average and choose the top sentences to make our summary. Here a sentence vector has components term frequency in sentence * IDF score of term.

Parameters

object $dom

document object model of page to summarize

string $page

complete raw page to generate the summary from.

string $lang

language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Returns

array —

a triple (string summary, array word cloud, array of position => scores for positions within the summary)

getPunctuatedUnpunctuatedSentences()

getPunctuatedUnpunctuatedSentences(object  $dom, string  $content, string  $lang) : array

Breaks any content into sentences with and without punctuation

Parameters

object $dom

a document object to extract a description from.

string $content

complete page.

string $lang

local tag of the language for data being processed

Returns

array —

array [sentences_with_punctuation, sentences_with_punctuation_stripped]

getSentences()

getSentences(string  $content) : array

Breaks any content into sentences by splitting it on spaces or carriage returns

Parameters

string $content

complete page.

Returns

array —

array of sentences from that content.

formatSentence()

formatSentence(string  $sentence) : string

Formats the sentences to remove all characters except words, digits and spaces

Parameters

string $sentence

complete page.

Returns

string —

formatted sentences.

formatDoc()

formatDoc(string  $content) : string

Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.

The formatted document generated by this function is only used to compute centroid.

Parameters

string $content

formatted page.

Returns

string —

formatted document.

pageProcessing()

pageProcessing(string  $page) : string

This function does an additional processing on the page such as removing all the tags from the page

Parameters

string $page

complete page.

Returns

string —

processed page.

removeStopWords()

removeStopWords(array  $sentences, object  $stop_obj) : array

Returns a new array of sentences without the stop words

Parameters

array $sentences

the array of sentences to process

object $stop_obj

the class that has the stopworedRemover method

Returns

array —

a new array of sentences without the stop words

removePunctuation()

removePunctuation(array  $sentences) : array

Remove punctuation from an array of sentences

Parameters

array $sentences

the sentences in the doc

Returns

array —

the array of sentences with the punctuation removed

getTermsFromSentences()

getTermsFromSentences(array  $sentences, string  $lang) : array

Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.

Parameters

array $sentences

the sentences in the doc

string $lang

locale tag for stemming

Returns

array —

an array of terms in the array of sentences

computeTermFrequenciesPerSentence()

computeTermFrequenciesPerSentence(array  $sentences, string  $lang) : array

Splits sentences into terms and returns [array of terms, array normalized term frequencies]

Parameters

array $sentences

the array of sentences to process

string $lang

the current locale

Returns

array —

an array with [array of terms, array normalized term frequencies] pairs

getTermFrequencies()

getTermFrequencies(array  $terms, mixed  $sentence_or_sentences) : array

Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences

Parameters

array $terms

the list of all terms in the doc

mixed $sentence_or_sentences

either a single string sentence or an array of sentences

Returns

array —

sequence of term => frequency pairs

wordCloudFromSummary()

wordCloudFromSummary(string  $summary, string  $lang, array  $term_frequencies = null) : array

Generates an array of most important words from a string $summary.

Currently, the algorithm is a based on terms frequencies after stopwords removed

Parameters

string $summary

text to derive most important words of

string $lang

locale tag for language of $summary

array $term_frequencies

a supplied list of terms and frequencies for words in summary. If null then these will be computed.

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

wordCloudFromTermVector()

wordCloudFromTermVector(array  $term_vector, mixed  $terms = false) : array

Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms

Parameters

array $term_vector

if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.

mixed $terms

if not false, then should be an array of terms, at a minimum having all the indices of $term_vector

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

getSummaryFromSentenceScores()

getSummaryFromSentenceScores(array  $sentence_scores, array  $sentences, string  $lang) : string

Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.

Parameters

array $sentence_scores

an array sorted by score of sentence_index => score pairs.

array $sentences

the array of sentences corresponding to sentence $sentence_scores indices

string $lang

language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Returns

string —

a string that represents the summary

numSentencesForSummary()

numSentencesForSummary(array  $sentence_scores, array  $sentences) : integer

Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.

Parameters

array $sentence_scores

associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).

array $sentences

sentences in doc in their original order

Returns

integer —

number of sentences

computeCentroidIdfFromSentences()

computeCentroidIdfFromSentences(array  $terms, array  $sentences, string  $formatted_doc, string  $lang) : array

Computes a number of occurrences of term * inverse sentence frequency vector over all terms in the document as well as inverse sentence frequencies for each term in a document.

Parameters

array $terms

distinct terms in a document

array $sentences

sentences of a document

string $formatted_doc

original document with some punctuation removed

string $lang

locale tag for document

Returns

array —

[truncated to maximal self::CENTROID_COMPONENTS number of occurrences of term * inverse sentence frequency vector, array of inverse sentence frequencies for each term in document]

scoreSentencesVersusPageTerms()

scoreSentencesVersusPageTerms(array  $sentences, array  $centroid, array  $idf, array  $terms) : array

Calculates scores for an array of sentences using normalized tf-idf score vector of sentence dot centroid vector.

Parameters

array $sentences

unpunctated sentences from a source in the order they originally appeared in the source

array $centroid

an array of term_index => nt *idf scores for that term. Here nt number of times term appear in whole document idf is inverse document frequency for that term amongst the sentences

array $idf

array of pairs of form term_index => inverse document frequencies of term amongst sentences

array $terms

an array of terms from the sentences that term_indexes mentioned above index into

Returns

array —

scores for each sentence