diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 178a866..d495dc1 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -15,6 +15,7 @@
<li><a href="#userroles">Managing Users and Roles</a></li>
<li><a href="#crawls">Managing Crawls</a></li>
<li><a href="#mixes">Mixing Crawl Indexes</a></li>
+ <li><a href="#classifiers">Classifying Web Pages</a></li>
<li><a href="#page-options">Page Indexing and Search Options</a></li>
<li><a href="#editor">Results Editor</a></li>
<li><a href="#sources">Search Sources</a></li>
@@ -1925,11 +1926,154 @@ encoding = "ASCII";
be clicked.
</p>
<p><a href="#toc">Return to table of contents</a>.</p>
+
+ <h2 id='classifiers'>Classifying Web Pages</h2>
+ <p>Sometimes searching for text that occurs within a page isn't enough to
+ find what one is looking for. For example, the relevant set of documents
+ may have many terms in common, with only a small subset showing up on any
+ particular page, so that one would have to search for many disjoint terms
+ in order to find all relevant pages. Or one may not know which terms are
+ relevant, making it hard to formulate an appropriate query. Or the relevant
+ documents may share many key terms with irrelevant documents, making it
+ difficult to formulate a query that fetches one but not the other. Under
+ these circumstances (among others), it would be useful to have meta words
+ already associated with the relevant documents, so that one could just
+ search for the meta word. The Classifiers activity provides a way to train
+ classifiers that recognize classes of documents; these classifiers can then
+ be used during a crawl to add appropriate meta words to pages determined to
+ belong to one or more classes.</p>
+
+ <p>Clicking on the Classifiers activity displays a text field where you can
+ create a new classifier, and a table of existing classifiers, where each
+ row corresponds to a classifier and provides some statistics and action
+ links. A classifier is identified by its class label, which is also used to
+ form the meta word that will be attached to documents. Each classifier can
+ only be trained to recognize instances of a single target class, so the
+ class label should be a short description of that class, containing only
+ alphanumeric characters and underscores (e.g., "spam",
+ "homepage", or "menu"). Typing a new class label into
+ the text box and hitting the Create button initializes a new classifier,
+ which will then show up in the table.</p>
+
+ <img src="resources/ClassifiersManage.png"
+ alt="The Classifiers manage page" />
+
+ <p>Once you have a fresh classifier, the natural thing to do is edit it by
+ clicking on the Edit action link. If you made a mistake, however, or no
+ longer want a classifier for some reason, then you can click on the Delete
+ action link to delete it; this cannot be undone. The Finalize action link
+ is used to prepare a classifier to classify new web pages, which cannot be
+ done until you've added some training examples. We'll discuss how to add
+ new examples next, then return to the Finalize link.</p>
+
+ <h3>Editing a Classifier</h3>
+
+ <p>Clicking on the Edit action link takes you to a new page where you can
+ change a classifier's class label, view some statistics, and provide
+ examples of positive and negative instances of the target class. The first
+ two options should be self-explanatory, but the last is somewhat involved.
+ A classifier needs labeled training examples in order to learn to recognize
+ instances of a particular class, and you help provide these by picking out
+ example pages from previous crawls and telling the classification system
+ whether they belong to the class or do not belong to the class. The Add
+ Examples section of the Edit Classifier page lets you select an existing
+ crawl to draw potential examples from, and optionally narrow down the
+ examples to those that satisfy a query. Once you've done this, clicking the
+ Load button will send a request to the server to load some pages from the
+ crawl and choose the next one to receive a label. You'll be presented with
+ a record representing the selected document, similar to a search result,
+ with several action links along the side that let you mark this document as
+ either a positive or negative example of the target class, or skip this
+ document and move on to the next one:</p>
+
+ <img src="resources/ClassifiersEdit.png" alt="The Classifiers edit page" />
+
+ <p>When you select any of the action buttons, your choice is sent back to
+ the server, and a new example to label is sent back (so long as there are
+ more examples in the selected index). The old example record is shifted
+ down the page and its background color updated to reflect your
+ decision—green for a positive example, red for a negative one, and
+ gray for a skip; the statistics at the top of the page are updated
+ accordingly. The new example record replaces the old one, and the process
+ repeats. Each time a new label is sent to the server, it is added to the
+ training set that will ultimately be used to prepare the classifier to
+ classify new web pages during a crawl. Each time you label a set number of
+ new examples (10 by default), the classifier will also estimate its current
+ accuracy by splitting the current training set into training and testing
+ portions, training a simple classifier on the training portion, and testing
+ on the remainder (checking the classifier output against the known labels).
+ The new estimated accuracy, calculated as the proportion of the test pages
+ classified correctly, is displayed under the Statistics section. You can
+ also manually request an updated accuracy estimate by clicking the Update
+ action link next to the Accuracy field. Doing this will send a request to
+ the server that will initiate the same process described previously, and
+ after a delay, display the new estimate.</p>
+
+ <p>All of this happens without reloading the page, so avoid using the web
+ browser's Back button. If you do end up reloading the page somehow, then
+ the current example record and the list of previously-labeled examples will
+ be gone, but none of your progress toward building the training set will be
+ lost.</p>
+
+ <h3>Finalizing a Classifier</h3>
+
+ <p>Editing a classifier adds new labeled examples to the training set,
+ providing the classifier with a more complete picture of the kinds of
+ documents it can expect to see in the future. In order to take advantage of
+ an expanded training set, though, you need to <em>finalize</em> the
+ classifier. This is broken out into a separate step because it involves
+ optimizing a function over the entire training set, which can be slow for
+ even a few hundred example documents. It wouldn't be practical to wait for
+ the classifier to re-train each time you add a new example, so you have to
+ explicitly tell the classifier that you're done adding examples for now by
+ clicking on the Finalize action link on the classifier management page.</p>
+
+ <p>Clicking this link will kick off a separate process that trains the
+ classifier in the background. When the page reloads, the Finalize link
+ should have changed to text that reads "Finalizing..." (but if
+ the training set is very small, training may complete almost immediately).
+ After starting finalization, it's fine to walk away for a bit, reload the
+ page, or carry out some unrelated task in the admin console. You shouldn't
+ however, make further changes to the classifier's training set, or start a
+ new crawl that makes use of the classifier. When the classifier finishes
+ its training phase, the Finalizing message will be replaced by one that
+ reads "Finalized" (you'll have to reload the page, as it will not
+ update itself), indicating that the classifier is ready for use.</p>
+
+ <h3>Using a Classifier</h3>
+
+ <p>Using a classifier is as simple as selecting the classifier's label
+ on the Page Options activity, under the "Classifiers to Apply"
+ heading. When the next crawl starts, the classifier (and any other selected
+ classifiers) will be applied to each fetched page, and if a page is
+ determined to belong to a target class, it will have several meta words
+ added. As an example, if the target class is "spam", and a page
+ is determined to belong to the class with probability .79, then the
+ page will have the following meta words added:</p>
+
+ <ul>
+ <li>class:spam</li>
+ <li>class:spam:50plus</li>
+ <li>class:spam:60plus</li>
+ <li>class:spam:70plus</li>
+ <li>class:spam:70</li>
+ </ul>
+
+ <p>These meta words allow one to search for all pages classified as spam at
+ any probability over the preset threshold of .50 (with class:spam), at any
+ probability over a specific multiple of .1 (e.g., over .6 with
+ class:spam:60plus), or within a specific range (e.g., .60–.69 with
+ class:spam:60). Note that no meta words are added if the probability falls
+ below the threshold, so no page will ever have the meta words
+ class:spam:10plus, class:spam:20plus, class:spam:20, and so on.</p>
+
+ <p><a href="#toc">Return to table of contents</a>.</p>
+
<h2 id='page-options'>Page Indexing and Search Options</h2>
- <p>Several properties about how web pages are indexed and
- how pages are looked up at search time can be controlled
- by clicking on Page Options. There are three tabs for this activity: Crawl Time,
- Search Time, and Test Options. We will discuss each of these in turn.</p>
+ <p>Several properties about how web pages are indexed and how pages are
+ looked up at search time can be controlled by clicking on Page Options.
+ There are three tabs for this activity: Crawl Time, Search Time, and Test
+ Options. We will discuss each of these in turn.</p>
<h3>Crawl Time Tab</h3>
<p>Clicking on Page Options leads to the default Crawl Time Tab:</p>
<img src='resources/PageOptionsCrawl.png' alt='The Page Options Crawl form'/>
@@ -1985,7 +2129,19 @@ encoding = "ASCII";
check the unknown checkbox in the upper left of this list.
</p>
<p>
- The indexing plugins checkboxes, allow you to select which plugins
+ The Classifiers to Apply checkboxes allow you to select the classifiers
+ that will be used to classify pages during a crawl. Each classifier (see
+ the <a href="#classifiers">Classifiers</a> section for details) is
+ represented in the list by its class label and a checkbox. Checking the box
+ indicates that the associated classifier should be used (made active)
+ during the next crawl. Each active classifier is run on each page
+ downloaded during a crawl, and if the page is determined to belong to the
+ class that the classifier has been trained to recognize, then a meta word
+ like "class:<i>label</i>", where <i>label</i> is the class label,
+ is added to the page summary.
+ </p>
+ <p>
+ The Indexing Plugins checkboxes allow you to select which plugins
to use during the crawl. For instance,
clicking the RecipePlugin checkbox would cause Yioop to run the code
in indexing_plugins/recipe_plugin.php. This code tries to detect pages