MAX_DISTINCT_TERMS
MAX_DISTINCT_TERMS
Number of distinct terms to use in generating summary
Base class for all summarizers. Summarizers chief method is getSummary which is supposed to take a text or XML document and produces a summary of that document up to PageProcessor::$max_description_len many characters. Summarizers also contain various methods to generate word cloud from such a summary
getSummary(object $dom, string $page, string $lang) : array
Compute a summary, word cloud, and scores for text ranges within the summary of a document in a given language
object | $dom | document object model used to locate items for summary |
string | $page | raw document sentences should be extracted from |
string | $lang | locale tag for language the summary is in |
[$summary, $word_cloud, $summary_scores]
getPunctuatedUnpunctuatedSentences(object $dom, string $content, string $lang) : array
Breaks any content into sentences with and without punctuation
object | $dom | a document object to extract a description from. |
string | $content | complete page. |
string | $lang | local tag of the language for data being processed |
array [sentences_with_punctuation, sentences_with_punctuation_stripped]
formatDoc(string $content) : string
Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.
The formatted document generated by this function is only used to compute centroid.
string | $content | formatted page. |
formatted document.
removeStopWords(array $sentences, object $stop_obj) : array
Returns a new array of sentences without the stop words
array | $sentences | the array of sentences to process |
object | $stop_obj | the class that has the stopworedRemover method |
a new array of sentences without the stop words
getTermsFromSentences(array $sentences, string $lang) : array
Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.
array | $sentences | the sentences in the doc |
string | $lang | locale tag for stemming |
an array of terms in the array of sentences
computeTermFrequenciesPerSentence(array $sentences, string $lang) : array
Splits sentences into terms and returns [array of terms, array normalized term frequencies]
array | $sentences | the array of sentences to process |
string | $lang | the current locale |
an array with [array of terms, array normalized term frequencies] pairs
getTermFrequencies(array $terms, mixed $sentence_or_sentences) : array
Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences
array | $terms | the list of all terms in the doc |
mixed | $sentence_or_sentences | either a single string sentence or an array of sentences |
sequence of term => frequency pairs
wordCloudFromSummary(string $summary, string $lang, array $term_frequencies = null) : array
Generates an array of most important words from a string $summary.
Currently, the algorithm is a based on terms frequencies after stopwords removed
string | $summary | text to derive most important words of |
string | $lang | locale tag for language of $summary |
array | $term_frequencies | a supplied list of terms and frequencies for words in summary. If null then these will be computed. |
the top self::WORD_CLOUD_LEN most important terms in $summary
wordCloudFromTermVector(array $term_vector, mixed $terms = false) : array
Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms
array | $term_vector | if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs. |
mixed | $terms | if not false, then should be an array of terms, at a minimum having all the indices of $term_vector |
the top self::WORD_CLOUD_LEN most important terms in $summary
getSummaryFromSentenceScores(array $sentence_scores, array $sentences, string $lang) : string
Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.
array | $sentence_scores | an array sorted by score of sentence_index => score pairs. |
array | $sentences | the array of sentences corresponding to sentence $sentence_scores indices |
string | $lang | language of the page to decide which stop words to call proper tokenizer.php of the specified language. |
a string that represents the summary
numSentencesForSummary(array $sentence_scores, array $sentences) : integer
Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.
array | $sentence_scores | associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score). |
array | $sentences | sentences in doc in their original order |
number of sentences