New classification documentation a=shawn

Shawn Tice [2013-05-10 05:May:th]

New classification documentation a=shawn

This change adds new images, which are kept in the separate seek_quarry
repository.

Signed-off-by: Chris Pollett <chris@pollett.org>

Filename
en-US/pages/documentation.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 178a866..d495dc1 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -15,6 +15,7 @@
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
+        <li><a href="#classifiers">Classifying Web Pages</a></li>
         <li><a href="#page-options">Page Indexing and Search Options</a></li>
         <li><a href="#editor">Results Editor</a></li>
         <li><a href="#sources">Search Sources</a></li>
@@ -1925,11 +1926,154 @@ encoding = "ASCII";
     be clicked.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
+
+    <h2 id='classifiers'>Classifying Web Pages</h2>
+    <p>Sometimes searching for text that occurs within a page isn't enough to
+    find what one is looking for. For example, the relevant set of documents
+    may have many terms in common, with only a small subset showing up on any
+    particular page, so that one would have to search for many disjoint terms
+    in order to find all relevant pages. Or one may not know which terms are
+    relevant, making it hard to formulate an appropriate query. Or the relevant
+    documents may share many key terms with irrelevant documents, making it
+    difficult to formulate a query that fetches one but not the other.  Under
+    these circumstances (among others), it would be useful to have meta words
+    already associated with the relevant documents, so that one could just
+    search for the meta word. The Classifiers activity provides a way to train
+    classifiers that recognize classes of documents; these classifiers can then
+    be used during a crawl to add appropriate meta words to pages determined to
+    belong to one or more classes.</p>
+
+    <p>Clicking on the Classifiers activity displays a text field where you can
+    create a new classifier, and a table of existing classifiers, where each
+    row corresponds to a classifier and provides some statistics and action
+    links. A classifier is identified by its class label, which is also used to
+    form the meta word that will be attached to documents. Each classifier can
+    only be trained to recognize instances of a single target class, so the
+    class label should be a short description of that class, containing only
+    alphanumeric characters and underscores (e.g., &quot;spam&quot;,
+    &quot;homepage&quot;, or &quot;menu&quot;). Typing a new class label into
+    the text box and hitting the Create button initializes a new classifier,
+    which will then show up in the table.</p>
+
+    <img src="resources/ClassifiersManage.png"
+        alt="The Classifiers manage page" />
+
+    <p>Once you have a fresh classifier, the natural thing to do is edit it by
+    clicking on the Edit action link. If you made a mistake, however, or no
+    longer want a classifier for some reason, then you can click on the Delete
+    action link to delete it; this cannot be undone. The Finalize action link
+    is used to prepare a classifier to classify new web pages, which cannot be
+    done until you've added some training examples. We'll discuss how to add
+    new examples next, then return to the Finalize link.</p>
+
+    <h3>Editing a Classifier</h3>
+
+    <p>Clicking on the Edit action link takes you to a new page where you can
+    change a classifier's class label, view some statistics, and provide
+    examples of positive and negative instances of the target class. The first
+    two options should be self-explanatory, but the last is somewhat involved.
+    A classifier needs labeled training examples in order to learn to recognize
+    instances of a particular class, and you help provide these by picking out
+    example pages from previous crawls and telling the classification system
+    whether they belong to the class or do not belong to the class. The Add
+    Examples section of the Edit Classifier page lets you select an existing
+    crawl to draw potential examples from, and optionally narrow down the
+    examples to those that satisfy a query. Once you've done this, clicking the
+    Load button will send a request to the server to load some pages from the
+    crawl and choose the next one to receive a label.  You'll be presented with
+    a record representing the selected document, similar to a search result,
+    with several action links along the side that let you mark this document as
+    either a positive or negative example of the target class, or skip this
+    document and move on to the next one:</p>
+
+    <img src="resources/ClassifiersEdit.png" alt="The Classifiers edit page" />
+
+    <p>When you select any of the action buttons, your choice is sent back to
+    the server, and a new example to label is sent back (so long as there are
+    more examples in the selected index). The old example record is shifted
+    down the page and its background color updated to reflect your
+    decision&mdash;green for a positive example, red for a negative one, and
+    gray for a skip; the statistics at the top of the page are updated
+    accordingly. The new example record replaces the old one, and the process
+    repeats. Each time a new label is sent to the server, it is added to the
+    training set that will ultimately be used to prepare the classifier to
+    classify new web pages during a crawl. Each time you label a set number of
+    new examples (10 by default), the classifier will also estimate its current
+    accuracy by splitting the current training set into training and testing
+    portions, training a simple classifier on the training portion, and testing
+    on the remainder (checking the classifier output against the known labels).
+    The new estimated accuracy, calculated as the proportion of the test pages
+    classified correctly, is displayed under the Statistics section. You can
+    also manually request an updated accuracy estimate by clicking the Update
+    action link next to the Accuracy field. Doing this will send a request to
+    the server that will initiate the same process described previously, and
+    after a delay, display the new estimate.</p>
+
+    <p>All of this happens without reloading the page, so avoid using the web
+    browser's Back button. If you do end up reloading the page somehow, then
+    the current example record and the list of previously-labeled examples will
+    be gone, but none of your progress toward building the training set will be
+    lost.</p>
+
+    <h3>Finalizing a Classifier</h3>
+
+    <p>Editing a classifier adds new labeled examples to the training set,
+    providing the classifier with a more complete picture of the kinds of
+    documents it can expect to see in the future. In order to take advantage of
+    an expanded training set, though, you need to <em>finalize</em> the
+    classifier. This is broken out into a separate step because it involves
+    optimizing a function over the entire training set, which can be slow for
+    even a few hundred example documents. It wouldn't be practical to wait for
+    the classifier to re-train each time you add a new example, so you have to
+    explicitly tell the classifier that you're done adding examples for now by
+    clicking on the Finalize action link on the classifier management page.</p>
+
+    <p>Clicking this link will kick off a separate process that trains the
+    classifier in the background. When the page reloads, the Finalize link
+    should have changed to text that reads &quot;Finalizing...&quot; (but if
+    the training set is very small, training may complete almost immediately).
+    After starting finalization, it's fine to walk away for a bit, reload the
+    page, or carry out some unrelated task in the admin console. You shouldn't
+    however, make further changes to the classifier's training set, or start a
+    new crawl that makes use of the classifier. When the classifier finishes
+    its training phase, the Finalizing message will be replaced by one that
+    reads &quot;Finalized&quot; (you'll have to reload the page, as it will not
+    update itself), indicating that the classifier is ready for use.</p>
+
+    <h3>Using a Classifier</h3>
+
+    <p>Using a classifier is as simple as selecting the classifier's label
+    on the Page Options activity, under the &quot;Classifiers to Apply&quot;
+    heading. When the next crawl starts, the classifier (and any other selected
+    classifiers) will be applied to each fetched page, and if a page is
+    determined to belong to a target class, it will have several meta words
+    added. As an example, if the target class is &quot;spam&quot;, and a page
+    is determined to belong to the class with probability .79, then the
+    page will have the following meta words added:</p>
+
+    <ul>
+        <li>class:spam</li>
+        <li>class:spam:50plus</li>
+        <li>class:spam:60plus</li>
+        <li>class:spam:70plus</li>
+        <li>class:spam:70</li>
+    </ul>
+
+    <p>These meta words allow one to search for all pages classified as spam at
+    any probability over the preset threshold of .50 (with class:spam), at any
+    probability over a specific multiple of .1 (e.g., over .6 with
+    class:spam:60plus), or within a specific range (e.g., .60&ndash;.69 with
+    class:spam:60). Note that no meta words are added if the probability falls
+    below the threshold, so no page will ever have the meta words
+    class:spam:10plus, class:spam:20plus, class:spam:20, and so on.</p>
+
+    <p><a href="#toc">Return to table of contents</a>.</p>
+
     <h2 id='page-options'>Page Indexing and Search Options</h2>
-    <p>Several properties about how web pages are indexed and
-    how pages are looked up at search time can be controlled
-    by clicking on Page Options. There are three tabs for this activity: Crawl Time,
-    Search Time, and Test Options. We will discuss each of these in turn.</p>
+    <p>Several properties about how web pages are indexed and how pages are
+    looked up at search time can be controlled by clicking on Page Options.
+    There are three tabs for this activity: Crawl Time, Search Time, and Test
+    Options. We will discuss each of these in turn.</p>
     <h3>Crawl Time Tab</h3>
     <p>Clicking on Page Options leads to the default Crawl Time Tab:</p>
 <img src='resources/PageOptionsCrawl.png' alt='The Page Options Crawl form'/>
@@ -1985,7 +2129,19 @@ encoding = "ASCII";
     check the unknown checkbox in the upper left of this list.
     </p>
     <p>
-    The indexing plugins checkboxes, allow you to select which plugins
+    The Classifiers to Apply checkboxes allow you to select the classifiers
+    that will be used to classify pages during a crawl. Each classifier (see
+    the <a href="#classifiers">Classifiers</a> section for details) is
+    represented in the list by its class label and a checkbox. Checking the box
+    indicates that the associated classifier should be used (made active)
+    during the next crawl. Each active classifier is run on each page
+    downloaded during a crawl, and if the page is determined to belong to the
+    class that the classifier has been trained to recognize, then a meta word
+    like &quot;class:<i>label</i>&quot;, where <i>label</i> is the class label,
+    is added to the page summary.
+    </p>
+    <p>
+    The Indexing Plugins checkboxes allow you to select which plugins
     to use during the crawl. For instance,
     clicking the RecipePlugin checkbox would cause Yioop to run the code
     in indexing_plugins/recipe_plugin.php. This code tries to detect pages

ViewGit