\seekquarry\yioop\library\summarizersSummarizer

Base class for all summarizers. Summarizers chief method is getSummary which is supposed to take a text or XML document and produces a summary of that document up to PageProcessor::$max_description_len many characters. Summarizers also contain various methods to generate word cloud from such a summary

Summary

Methods
Properties
Constants
getSummary()
getPunctuatedUnpunctuatedSentences()
getSentences()
formatSentence()
formatDoc()
pageProcessing()
removeStopWords()
removePunctuation()
getTermsFromSentences()
computeTermFrequenciesPerSentence()
getTermFrequencies()
wordCloudFromSummary()
wordCloudFromTermVector()
getSummaryFromSentenceScores()
numSentencesForSummary()
No public properties found
MAX_DISTINCT_TERMS
CENTROID_COMPONENTS
WORD_CLOUD_LEN
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

MAX_DISTINCT_TERMS

MAX_DISTINCT_TERMS

Number of distinct terms to use in generating summary

CENTROID_COMPONENTS

CENTROID_COMPONENTS

Number of nonzero centroid components

WORD_CLOUD_LEN

WORD_CLOUD_LEN

Number of words in word cloud

Methods

getSummary()

getSummary(object  $dom, string  $page, string  $lang) : array

Compute a summary, word cloud, and scores for text ranges within the summary of a document in a given language

Parameters

object $dom

document object model used to locate items for summary

string $page

raw document sentences should be extracted from

string $lang

locale tag for language the summary is in

Returns

array —

[$summary, $word_cloud, $summary_scores]

getPunctuatedUnpunctuatedSentences()

getPunctuatedUnpunctuatedSentences(object  $dom, string  $content, string  $lang) : array

Breaks any content into sentences with and without punctuation

Parameters

object $dom

a document object to extract a description from.

string $content

complete page.

string $lang

local tag of the language for data being processed

Returns

array —

array [sentences_with_punctuation, sentences_with_punctuation_stripped]

getSentences()

getSentences(string  $content) : array

Breaks any content into sentences by splitting it on spaces or carriage returns

Parameters

string $content

complete page.

Returns

array —

array of sentences from that content.

formatSentence()

formatSentence(string  $sentence) : string

Formats the sentences to remove all characters except words, digits and spaces

Parameters

string $sentence

complete page.

Returns

string —

formatted sentences.

formatDoc()

formatDoc(string  $content) : string

Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.

The formatted document generated by this function is only used to compute centroid.

Parameters

string $content

formatted page.

Returns

string —

formatted document.

pageProcessing()

pageProcessing(string  $page) : string

This function does an additional processing on the page such as removing all the tags from the page

Parameters

string $page

complete page.

Returns

string —

processed page.

removeStopWords()

removeStopWords(array  $sentences, object  $stop_obj) : array

Returns a new array of sentences without the stop words

Parameters

array $sentences

the array of sentences to process

object $stop_obj

the class that has the stopworedRemover method

Returns

array —

a new array of sentences without the stop words

removePunctuation()

removePunctuation(array  $sentences) : array

Remove punctuation from an array of sentences

Parameters

array $sentences

the sentences in the doc

Returns

array —

the array of sentences with the punctuation removed

getTermsFromSentences()

getTermsFromSentences(array  $sentences, string  $lang) : array

Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.

Parameters

array $sentences

the sentences in the doc

string $lang

locale tag for stemming

Returns

array —

an array of terms in the array of sentences

computeTermFrequenciesPerSentence()

computeTermFrequenciesPerSentence(array  $sentences, string  $lang) : array

Splits sentences into terms and returns [array of terms, array normalized term frequencies]

Parameters

array $sentences

the array of sentences to process

string $lang

the current locale

Returns

array —

an array with [array of terms, array normalized term frequencies] pairs

getTermFrequencies()

getTermFrequencies(array  $terms, mixed  $sentence_or_sentences) : array

Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences

Parameters

array $terms

the list of all terms in the doc

mixed $sentence_or_sentences

either a single string sentence or an array of sentences

Returns

array —

sequence of term => frequency pairs

wordCloudFromSummary()

wordCloudFromSummary(string  $summary, string  $lang, array  $term_frequencies = null) : array

Generates an array of most important words from a string $summary.

Currently, the algorithm is a based on terms frequencies after stopwords removed

Parameters

string $summary

text to derive most important words of

string $lang

locale tag for language of $summary

array $term_frequencies

a supplied list of terms and frequencies for words in summary. If null then these will be computed.

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

wordCloudFromTermVector()

wordCloudFromTermVector(array  $term_vector, mixed  $terms = false) : array

Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms

Parameters

array $term_vector

if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.

mixed $terms

if not false, then should be an array of terms, at a minimum having all the indices of $term_vector

Returns

array —

the top self::WORD_CLOUD_LEN most important terms in $summary

getSummaryFromSentenceScores()

getSummaryFromSentenceScores(array  $sentence_scores, array  $sentences, string  $lang) : string

Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.

Parameters

array $sentence_scores

an array sorted by score of sentence_index => score pairs.

array $sentences

the array of sentences corresponding to sentence $sentence_scores indices

string $lang

language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Returns

string —

a string that represents the summary

numSentencesForSummary()

numSentencesForSummary(array  $sentence_scores, array  $sentences) : integer

Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.

Parameters

array $sentence_scores

associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).

array $sentences

sentences in doc in their original order

Returns

integer —

number of sentences