Add documentation for pos tagging, summarizers, address word plugins, and new page extraction lan, a=chris

Chris Pollett [2014-06-25 23:Jun:th]
Add documentation for pos tagging, summarizers, address word plugins, and new page extraction lan, a=chris
Filename
en-US/pages/documentation.thtml
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 534f7f9..3f152c3 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -2347,6 +2347,16 @@ define('HHVM_PATH', '/usr/local/bin');
     requirements on disk space needed for a crawl; bigger numbers would
     tend to improve the search results. If whole pages are being cached,
     these downloaded bytes are stored in archives with the fetcher.
+    The Summarizer dropdown control what summarizer is used on a page
+    during page processing. Yioop uses a summarizer to control what portions
+    of a page will be put into the index and are available at search time
+    for snippets. The two available summarizers are Basic, which picks
+    the pages meta title, meta description, h1 tags, etc in a fixed order
+    until the summary size is reached; and Centroid, which computes an
+    "average sentence" for the document and adds phrases from the actual
+    document according to nearness to this average. If Centroid summarizer is
+    used Yioop also generates a word cloud for each document. Centroid tends
+    to produces slightly better results than Basic but is slower.
     The Max Page Summary Length in Bytes controls how many of the total
     bytes can be used to make a page summary which is sent to the
     queue server. It is only words in this summary which can actually be
@@ -2419,16 +2429,28 @@ define('HHVM_PATH', '/usr/local/bin');
     </p>
     <p>
     The Indexing Plugins checkboxes allow you to select which plugins
-    to use during the crawl. Yioop comes with two built-in plugins:
-    a WordFilterPlugin and RecipePlugin. One can also write or downlaod
-    additional plugins. If the plugin can be configured,
-    next to the checkbox will be a link to a configuration screen. For example,
-    clicking the RecipePlugin checkbox causes Yioop during a crawl to run the
+    to use during the crawl. Yioop comes with three built-in plugins:
+    AddressesPlugin, RecipePlugin, and WordFilterPlugin. One can also write or
+    downlaod additional plugins. If the plugin can be configured,
+    next to the checkbox will be a link to a configuration screen. Let's
+    briefly look at each of these plugins in turn...</p>
+    <p>Checking the AddressesPlugin enables Yioop during a crawl
+    to try to calculate addresses for each page summary it creates. When
+    Yioop processes a page it by default creates a summary of the page with
+    a TITLE and a DESCRIPTION as well as a few other fields. With the addresses
+    plugin activated, it will try to
+    extract data to three additional fields: EMAILS, PHONE_NUMBERS,
+    and ADDRESSES. If you want to test out how these behave,
+    pick some web page, view source on the web page, copy the source, and then
+    paste into the Test Options Tab on the page options page (the Test
+    Options Tab is described  later in this section).</p>
+    <p>Clicking the RecipePlugin checkbox causes Yioop during a crawl to run the
     code in indexing_plugins/recipe_plugin.php. This code tries to detect pages
     which are food recipes and separately extracts these recipes and clusters
     them by ingredient. It then add search meta words ingredient: and
     recipe:all to allow one to search recipes by ingredient or only documents
-    containing recipes.  Checking the WordFilterPlugin causes Yioop to run
+    containing recipes.
+    <p>Checking the WordFilterPlugin causes Yioop to run
     code in indexing_plugins/wordfilter_plugin.php on each downloaded page.
     This code checks if the downloaded page has one of the words listed
     in the textarea one finds on the plugin's configure page. If it does,
@@ -2437,17 +2459,41 @@ define('HHVM_PATH', '/usr/local/bin');
     </p>
     <img src="resources/WordFilterConfigure.png"
         alt="Word Filter Configure Page" />
-    <p>Each line in the textarea consists of a word followed by a colon
-    followed by a comma separated list of what to do if that word is seen.
-    The line term0:NOTCONTAIN,JUSTFOLLOW says that if the downloaded page
+    <p>Each line in the textarea consists of a comma separated list of
+    literals followed by a colon followed by a comma separated list of what
+    to do if the literal condition is satisfied. This is called a
+    <b>rule</b>. A single literal in the
+    list of literals is an optional + or - followed by a sequence of non-space
+    characters. After the + or -, up until a # symbol is called the term in
+    the literal. If the literal sign is + or if no sign is present,
+    then the literal holds for a document if it contains the term, if the
+    literal sign is - then the literal holds for a document if it does not
+    contain the term, if there is a decimal number between 0 and 1, say x,
+    after the # up to a comma or the first white-space character, then this
+    is modified so the literal
+    holds only if x'th fraction of the documents length comes from the literal's
+    term. If rather than a decimal x were a positive natural number then
+    the term would need to occur x times.
+    If all the literal in the comma separated list hold, then the
+    rule is said to hold, and the actions will apply.
+    The line -term0:JUSTFOLLOW says that if the downloaded page
     does not contain the word "term0" then do not index the page, but do
     follow outgoing links from the page. The line term1:NOPROCESS says
     if the document has the word "term1" then do not index it or follow links
-    from it. The last line term2:NOFOLLOW,NOSNIPPET says if the
+    from it. The last line +term2:NOFOLLOW,NOSNIPPET says if the
     page contains "term2" then do not follow any outgoing links. NOSNIPPET
     means that if the page is returned from search results, the link to
     the page should not have a snippet of text from that page beneath it.
-    In addition, to the commands just mentioned, WordFilterPlugin supports
+    As an example of a more complicated rule, consider:</p>
+    <pre>
+    surfboard#2,bikini#0.02:NOINDEX, NOFOLLOW
+    </pre>
+    <p>
+    Here for the rule to hold the condition surfboard#2 requires that the
+    term surfboard occurred at least twice in the document and the
+    condition  bikini#0.02  requires that 0.02 percent of the documents total
+    length also come from copies of the word bikini. In addition, to the
+    commands just mentioned, WordFilterPlugin supports
     standard robots.txt directives such as: NOINDEX, NOCACHE,
     NOARCHIVE, NOODP, NOYDIR, and NONE. More details about how indexing
     plugins work and how to write your own indexing plugin can be
@@ -2476,34 +2522,56 @@ define('HHVM_PATH', '/usr/local/bin');
     </p>
     <p>
     A command statement takes a key field argument for the page associative
-    array and does a function call to manipulate that page. Right now the
-    supported commands are to unset that field value, to add the field and
-    field value to the META_WORD array for the page and to split the field on
-    comma, view this as a search keywords => link text association, and add
-    this the  KEYWORD_LINKS array. This can be used to add a link to a keyword
-    search on cached pages in Yioop's index. These three command have the
-    syntax:</p>
+    array and does a function call to manipulate that page. Below is
+    a list of currently supported commands followed by comments on what
+    they do:</p>
     <pre>
-    unset(field)
-    addMetaWords(field)
-    addKeywordLink(field)
+    addMetaWords(field)     ;add the field and field value to the META_WORD
+                            ;array for the page
+    addKeywordLink(field)   ;split the field on a comma, view this as a search
+                            ;keywords => link text association, and add this to
+                            ;the KEYWORD_LINKS array.
+    setStack(field)         ;set which field value should be used as a stack
+    pushStack(field)        ;add the field value for field to the top of stack
+    popStack(field)         ;pop the top of the stack into the field value for
+                            ;field
+    setOutputFolder(dir)    ;if auxiliary output, rather than just to the
+                            ;a yioop index, is being done, then set the folder
+                            ;for this output to be dir
+    setOutputFormat(format) ;set the format of auxiliary output.
+                            ;Should be either CSV or SQL
+                            ;SQL mean that writeOutput will write an insert
+                            ;statement
+    setOutputTable(table)   ;if output is SQL then what table to use for the
+                            ;insert statements
+    toArray(field)          ;splits field value for field on a comma and
+                            ;assign field value to be the resulting array
+    toString(field)         ;if field value is an array then implode that
+                            ;array using comma and store the result in field
+                            ;value
+    unset(field)            ;unset that field value
+    writeOutput(field)      ;use the contents of field value viewed as an array
+                            ;to fill in the columns of a SQL insert statement
+                            ;or CSV row
     </pre>
     <p>
     Page rule assignments can either be straight assignments with '=' or
     concatenation assignments with '.='. Let $page indicate the associative
     array that Yioop supplies the page rule processor.
-    There are three kinds of values that one can assign:
+    There are four kinds of values that one can assign:
     </p>
     <pre>
     field = some_other_field ; sets $page['field'] = $page['some_other_field']
     field = "some_string" ; sets $page['field'] to "some string"
     field = /some_regex/replacement_where_dollar_vars_allowed/
-    ; computes the results of replacing matches to some_regex in $page['field']
-    ; with replacement_where_dollar_vars_allowed
+        ; computes the results of replacing matches to some_regex
+        ;  in $page['field'] with replacement_where_dollar_vars_allowed
+    field = /some_regex/g ;sets $page['field'] to the array of all matches
+        ; of some regex in $page['field']
     </pre>
     <p>For each of the above assignments we could have used ".=" instead of "=".
-    We next give a simple example and a more complicated example of page rules
-    and the context in which they were used:
+    We next give a simple example and followed by a couple more complicated
+    examples of page rules and the context in which they were used:
     </p>
     <p>In the first example, we just want to extract meaningful titles for mail
     log records that were read in using a TextArchiveBundleIterator. Here
@@ -2625,6 +2693,47 @@ define('HHVM_PATH', '/usr/local/bin');
     unset(thread)
     unset(link_thread)
     </pre>
+    <p>As a last example of page rules, suppose we wanted to crawl the
+    web and whenever we detected a page had an address we wanted to
+    write that address as a SQL insert statement to a series of text files.
+    We can do this using page rules and the AddressesPlugin.
+    First, we would check the AddressesPlugin and then we might use
+    page rules like:</p>
+    <pre>
+    summary = ADDRESSES
+    setStack(summary)
+    pushStack(DESCRIPTION)
+    pushStack(TITLE)
+    setOutputFolder(/Applications/MAMP/htdocs/crawls/data)
+    setOutputFormat(sql)
+    setOutputTable(SUMMARY);
+    writeOutput(summary)
+    </pre>
+    <p>
+    The first line says copy the contents of the ADDRESSES field of the page
+    into a new summary field. The next line says use the summary field as the
+    current stack. At this point the stack would be an array
+    with all the addresses found on the given page. So you could use the command
+    like popStack(first_address) to copy the first address in this array over
+    to a new variable first_address. In the above case what we do instead
+    is push the contents of the DESCRIPTION field onto the top of the stack.
+    Then we push the contents of the TITLE
+    field. The line</p>
+    <pre>
+    setOutputFolder(/Applications/MAMP/htdocs/crawls/data)
+    </pre>
+    <p>
+    sets /Applications/MAMP/htdocs/crawls/data as the folder that any
+    auxiliary output from the page_processor should go to.
+    setOutputFormat(sql) says we want to output sql, the other possibility is
+    csv. The line setOutputTable(SUMMARY); says the table name to use for
+    INSERT statements should be called SUMMARY. Finally, the line
+    writeOutput(summary) would use the contents of the array entries of the
+    summary field as the column values for an INSERT statement into the SUMMARY
+    table. This writes a line to the file data.txt in
+    /Applications/MAMP/htdocs/crawls/data. If data.txt exceeds 10MB, it is
+    compressed into a file data.txt.0.gz and a new data.txt file is started.
+    </p>
     <h4>Search Time Tab</h4>
 <p>The Page Options Search Time tab looks like:</p>
 <img src='resources/PageOptionsSearch.png' alt='The Page Options Search form'/>
@@ -2632,8 +2741,9 @@ define('HHVM_PATH', '/usr/local/bin');
 which element and links you would like to have presented on the search
 landing and search results pages. The Word Suggest checkbox controls whether
 a dropdown of word suggestions should be presented by Yioop when a user
-starts typing in the Search box. The Subsearch checkbox controls whether the
-links for Image, Video, and News search appear in the top bar of Yioop
+starts typing in the Search box. It also controls whether spelling correction
+and thesaurus suggestions will appear The Subsearch checkbox controls whether
+the links for Image, Video, and News search appear in the top bar of Yioop
 You can actually configure what these links are in the
 <a href="#sources">Search Sources</a>
 activity. The checkbox here is a global setting for displaying them or
@@ -3349,6 +3459,57 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     of the transliteration. An example of doing this is given for the
     Telugu locale in Yioop.</p>
     <h4>Thesaurus Results and Part of Speech Tagging</h4>
+    <p>As mentioned in the <a href="#search-basic">Search Basics</a> topic,
+    for some queries Yioop displays a list of related queries to one side
+    of the search results. These are obtained from a "computer thesaurus".
+    In this subsection, we describe how to enable this facility for English
+    and how you could add this functionality for other languages.
+    If enabled, the thesaurus also can be used to modify search ranking
+    as described in the <a href="?c=main&p=ranking#reordering"
+    >Final Reordering</a> of the Yioop Ranking Mechanisms document.</p>
+    <p>In order to generate suggested related queries, Yioop first
+    tags the original query terms according to part of speech.
+    For the en-US, this is done by calling a method:
+    tagTokenizePartOfSpeech($text) in
+    WORK_DIRECTORY/locale/en-US/resources/tokenizer.php. For en-US,
+    a simple Brill tagger (see Ranking document for more info) is implemented
+    to do this. After this method is called the terms in $text should have
+    a suffix ~part-of-speech where ~part-of-speeech where part-of-speech
+    is one of NN for noun, VB for verb, AJ for adjective, AV for adverb, or
+    some other value (which would be ignored by Yioop). For example,
+    the noun dog might become dog~NN after tagging. To localize to another
+    language this method in the corresponding tokenizer.php file would need
+    to be implemented.</p>
+    <p>The second method needed for Thesaurus results is
+    scoredThesaurusMatches($term, $word_type, $whole_query) which should
+    also be in tokenizer.php for the desired locale. Here $term is
+    a term (without a  part-of-speech tag), $word_type is the part of speech
+    (one of the ones listed above), and $whole_query is the original query.
+    The output of this method should be an array of
+    (score =&gt; array of thesaurus terms) associations. The score
+    representing one word sense of term. In the case, of English, this
+    method is implemented using <a href="http://wordnet.princeton.edu/"
+    >WordNet</a>. So for thesaurus results to work for English, WordNet
+    needs to be installed and in either the config.php file or local_config.php
+    you need to define the constant WORDNET_EXEC to the path to the
+    WordNet executable on your file system. On a Linux or OSX system,
+    this might be something like: /usr/local/bin/wn .</p>
+    <h4>Using Stop Words to improve Centroid Summarization</h4>
+    While crawling, Yioop makes use of a summarizer to extract the important
+    portions of the web page both for indexing and for search result snippet
+    purposes. There are two summarizers that come with Yioop a Basic summarizer,
+    which uses an ad hoc approach to finding the most important parts of
+    the document, and a centroid summarizer which tries to compute an
+    "average sentence" for the document and uses this to pick representative
+    sentence based on nearness to this average. The summarizer that is used can
+    be set under the Crawl Time tab of <a href="#page-options">Page Options</a>.
+    This latter summarizer works better if certain common words (stop words)
+    from the documents language are removed. When using centroid summarizer,
+    Yioop check to see if tokenizer.php for the current locale contains a
+    method stopwordsRemover($page). If it does it calls it, this method takes
+    a string of words are returns a string with all the stop words removed.
+    This method exists for en-US, but, if desired, could also be implemented
+    for other locales to improve centroid summarization.
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id="advanced-topics">Advanced Topics</h2>
     <h3 id='customizing-code'>Modifying Yioop Code</h3>
ViewGit