MAX_DISTINCT_TERMS
MAX_DISTINCT_TERMS
Number of distinct terms to use in generating summary
Class which may be used by TextProcessors to get a summary for a text document that may later be used for indexing. This is done by the @see getSummmary method. getSummary does this splitting the document into sentences and computing inverse sentence frequency (should be ISL, but we call IDF) scores for each term. It then computes an average document vector (we call centroid) with components (total number of occurrences of term) * (IDF score of term).
It also generates a word cloud for a document. Notice if we divided this by number of documents, we would have components average term frequency IDF. As ranking by either won't affect out results, we don't divide. We then compute the cosine similarity of each sentence vector with this average and choose the top sentences to make our summary. Here a sentence vector has components term frequency in sentence IDF score of term.
getSummary(object $dom, string $page, string $lang) : array
Generates a summary, word cloud, and sentence scoring for a provides web page. To do this the page is split into sentences and inverse sentence frequency (should be ISL, but we call IDF) scores for each term term are computed. Then an average document vector (we call centroid) with components (total number of occurrences of term) * (IDF score of term) is found. We then compute the cosine similarity of each sentence vector with this average and choose the top sentences to make our summary. Here a sentence vector has components term frequency in sentence * IDF score of term.
object | $dom | document object model of page to summarize |
string | $page | complete raw page to generate the summary from. |
string | $lang | language of the page to decide which stop words to call proper tokenizer.php of the specified language. |
a triple (string summary, array word cloud, array of position => scores for positions within the summary)
getPunctuatedUnpunctuatedSentences(object $dom, string $content, string $lang) : array
Breaks any content into sentences with and without punctuation
object | $dom | a document object to extract a description from. |
string | $content | complete page. |
string | $lang | local tag of the language for data being processed |
array [sentences_with_punctuation, sentences_with_punctuation_stripped]
formatDoc(string $content) : string
Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.
The formatted document generated by this function is only used to compute centroid.
string | $content | formatted page. |
formatted document.
removeStopWords(array $sentences, object $stop_obj) : array
Returns a new array of sentences without the stop words
array | $sentences | the array of sentences to process |
object | $stop_obj | the class that has the stopworedRemover method |
a new array of sentences without the stop words
getTermsFromSentences(array $sentences, string $lang) : array
Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.
array | $sentences | the sentences in the doc |
string | $lang | locale tag for stemming |
an array of terms in the array of sentences
computeTermFrequenciesPerSentence(array $sentences, string $lang) : array
Splits sentences into terms and returns [array of terms, array normalized term frequencies]
array | $sentences | the array of sentences to process |
string | $lang | the current locale |
an array with [array of terms, array normalized term frequencies] pairs
getTermFrequencies(array $terms, mixed $sentence_or_sentences) : array
Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences
array | $terms | the list of all terms in the doc |
mixed | $sentence_or_sentences | either a single string sentence or an array of sentences |
sequence of term => frequency pairs
wordCloudFromSummary(string $summary, string $lang, array $term_frequencies = null) : array
Generates an array of most important words from a string $summary.
Currently, the algorithm is a based on terms frequencies after stopwords removed
string | $summary | text to derive most important words of |
string | $lang | locale tag for language of $summary |
array | $term_frequencies | a supplied list of terms and frequencies for words in summary. If null then these will be computed. |
the top self::WORD_CLOUD_LEN most important terms in $summary
wordCloudFromTermVector(array $term_vector, mixed $terms = false) : array
Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms
array | $term_vector | if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs. |
mixed | $terms | if not false, then should be an array of terms, at a minimum having all the indices of $term_vector |
the top self::WORD_CLOUD_LEN most important terms in $summary
getSummaryFromSentenceScores(array $sentence_scores, array $sentences, string $lang) : string
Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.
array | $sentence_scores | an array sorted by score of sentence_index => score pairs. |
array | $sentences | the array of sentences corresponding to sentence $sentence_scores indices |
string | $lang | language of the page to decide which stop words to call proper tokenizer.php of the specified language. |
a string that represents the summary
numSentencesForSummary(array $sentence_scores, array $sentences) : integer
Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.
array | $sentence_scores | associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score). |
array | $sentences | sentences in doc in their original order |
number of sentences
computeCentroidIdfFromSentences(array $terms, array $sentences, string $formatted_doc, string $lang) : array
Computes a number of occurrences of term * inverse sentence frequency vector over all terms in the document as well as inverse sentence frequencies for each term in a document.
array | $terms | distinct terms in a document |
array | $sentences | sentences of a document |
string | $formatted_doc | original document with some punctuation removed |
string | $lang | locale tag for document |
[truncated to maximal self::CENTROID_COMPONENTS number of occurrences of term * inverse sentence frequency vector, array of inverse sentence frequencies for each term in document]
scoreSentencesVersusPageTerms(array $sentences, array $centroid, array $idf, array $terms) : array
Calculates scores for an array of sentences using normalized tf-idf score vector of sentence dot centroid vector.
array | $sentences | unpunctated sentences from a source in the order they originally appeared in the source |
array | $centroid | an array of term_index => nt *idf scores for that term. Here nt number of times term appear in whole document idf is inverse document frequency for that term amongst the sentences |
array | $idf | array of pairs of form term_index => inverse document frequencies of term amongst sentences |
array | $terms | an array of terms from the sentences that term_indexes mentioned above index into |
scores for each sentence