Revised documentation for Version 0.96, a=chris

Chris Pollett [2013-07-24 06:Jul:th]
Revised documentation for Version 0.96, a=chris
Filename
en-US/pages/about.thtml
en-US/pages/coding.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml
en-US/pages/ranking.thtml
diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index bb4f9e6..c9cbcfe 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -1,9 +1,18 @@
 <h1>About SeekQuarry/Yioop</h1>
-<p>SeekQuarry is the parent site for <a href="http://www.yioop.com/">Yioop</a>.
-Both SeekQuarry and Yioop were written mainly by myself, <a
-href="http://www.cs.sjsu.edu/faculty/pollett">Chris Pollett</a>. The project
-began in Nov. 2009 and had its first publically available release in August,
-2010.
+<p>SeekQuarry, LLC is the company responsible for the
+<a href="http://www.yioop.com/">Yioop PHP Search Engine</a>
+project.
+SeekQuarry is owned by, and Yioop was mainly written by, myself, <a
+href="http://www.cs.sjsu.edu/faculty/pollett">Chris Pollett</a>. Development of
+Yioop began in Nov. 2009 and was first publicly
+released August, 2010. SeekQuarry maintains the documentation and official
+public code repository for Yioop. It is also responsible for the SeekQuarry
+and Yioop servers. SeekQuarry LLC receives revenue from
+<a href="?c=main&p=downloads#consulting" >consulting services</a> related to
+Yioop and by <a href="?c=main&p=downloads#contribute"
+>contributions</a> from people interested in the continued development of the
+Yioop Search Engine Software and in the documentary resources the Seekquarry
+website provides.
 </p>

 <h1>The Yioop and SeekQuarry Names</h1>
@@ -26,15 +35,15 @@ giup sounds like in English. It was already taken. So then I
 combined the two to get Yioop.</p>

 <h1>Dictionary Data</h1>
-<p>
-<a href="http://en.wikipedia.org/wiki/Bloom_Filter">Bloom filters</a> for
-n grams on the Yioop test site were generated using
-<a href="http://dumps.wikimedia.org/other/pagecounts-raw/">Wikimedia
-Page View Statistics</a>.
-<a href="http://en.wikipedia.org/wiki/Trie">Trie</a>'s for word suggestion
-for all languages other than Vietnamese were built
+<p>The
+<a href="http://en.wikipedia.org/wiki/Bloom_Filter">Bloom filter</a> for
+Chinese word segmentation was developed using the word list
+<a href="http://www.mdbg.net/chindict/chindict.php?page=cedict"
+>http://www.mdbg.net/chindict/chindict.php?page=cedict</a> which has
+a Creative Common License. <a href="http://en.wikipedia.org/wiki/Trie">Trie</a>'s
+for word suggestion for all languages other than Vietnamese were built
 using the <a href="http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists"
->Wiktionary Frequency Lists</a>. These are available under a
+>Wiktionary Frequency Lists</a>. These are also available under a
 <a href="http://creativecommons.org/licenses/by-sa/3.0/">Creative
 Commons Share Alike 3.0 Unported License</a> as described on <a
 href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">Wikipedia's
@@ -59,8 +68,8 @@ Ahmed Kamel Taha, and Sugi Widjaja. Thanks to Ravi Dhillon, Akshat Kukreti,
 Tanmayee Potluri, Shawn Tice, and Sandhya Vissapragada for
 creating patches for Yioop issues. Several of my master's students have done
 projects related to Yioop: Amith Chandranna, Priya Gangaraju,
-Vijaya Pamidi, Vijeth Patil, Vijaya Sinha, Tarun Pepira, Tanmayee Potluri, and
-Sandhya Vissapragada. Amith's code related to an
+Akshat Kukreti, Vijaya Pamidi, Vijeth Patil, Vijaya Sinha, Tarun Pepira,
+Tanmayee Potluri, Shawn Tice, and Sandhya Vissapragada. Amith's code related to an
 Online version of the HITs algorithm. It is not currently in the main branch of
 Yioop, but it is obtainable from
 <a href="http://www.cs.sjsu.edu/faculty/pollett/masters/Semesters/
@@ -85,10 +94,14 @@ concerned using Open Street Map data in Yioop. This code is also not currently
 in the main branch. Priya Gangaraju's code served as the
 basis for the plugin feature currently in Yioop. Shawn Tice's CS288
 project served as the basis of a rewrite of the archive crawl feature of Yioop
-for the multi-queue server setting. Sandhya Vissapragada's Master project served
-as the basis for the autosuggest and spell checking functionality in Yioop.
-The following other students have created text processors for Yioop: Nakul
-Natu (pptx), Vijeth Patil (epub), and Tarun Ramaswamy (xslx). Akshat Kukreti
-created the Italian language stemmer based on the Snowball version at
-<a href="http://tartarus.org/">http://tartarus.org/</a>.
+for the multi-queue server setting. His CS298 project served as the basis
+for classifier feature within Yioop. Sandhya Vissapragada's Master project
+served as the basis for the autosuggest and spell checking functionality in
+Yioop. The following other students have created text processors for Yioop:
+Nakul Natu (pptx), Vijeth Patil (epub), and Tarun Ramaswamy (xslx).
+Akshat Kukreti created the Italian language stemmer based on the Snowball
+version at <a href="http://tartarus.org/">http://tartarus.org/</a>. His CS298
+project served as the basis for the cache history feature and the as the basis
+etags support while crawling in Yioop.
+
 </p>
diff --git a/en-US/pages/coding.thtml b/en-US/pages/coding.thtml
index 81bafb8..591a7cf 100755
--- a/en-US/pages/coding.thtml
+++ b/en-US/pages/coding.thtml
@@ -120,9 +120,8 @@ $a = array(1, 2, 3, 4);
 for($i = 0; $i &lt; $num; $i++) {
 }
     </pre>
-    Some leeway may be given on this, if it helps make a line under 80
-    characters provided being under 80 characters in the instance in
-    question helps program clarity.
+    Some leeway may be given on this if it helps make a line under 80
+    characters -- provided being under 80 characters helps program clarity.
     </li>
     <li>Do not use unstable code layouts such as:
     <pre>
@@ -320,6 +319,8 @@ class MyClass
     This facilitates Yioop's auto-loading mechanism.
     </li>
     <li>Yioop code should not use PHP <a
+    href="http://php.net/manual/en/language.generators.overview.php"
+    >generators</a>, <a
     href="http://us3.php.net/manual/en/language.namespaces.php">namespaces</a>,
     <a href="http://php.net/manual/en/language.oop5.traits.php">traits</a>, or
     <a href="http://php.net/manual/en/functions.anonymous.php"
@@ -570,7 +571,8 @@ selector2,
     <ol>
     <li>Any web page output by Yioop should validate as <a
     href="http://www.w3.org/TR/html5/">HTML5</a>. This can be checked at
-    the site <a href="http://validator.w3.org">http://validator.w3.org</a>.</li>
+    the site <a href="http://validator.w3.org/">http://validator.w3.org/</a>.
+    </li>
     <li>Any web page output by Yioop should pass the Web accessibility checks
     of the <a href="http://wave.webaim.org/">WAVE Tool</a>.</li>
     <li>Web pages should render reasonably similarly in any version of
@@ -664,7 +666,7 @@ master's pages&lt;/a&gt;
     <ol>
     <li>Except in subclasses of DatasourceManager, Yioop PHP code should not
     directly call native PHP database functions. That is, functions with names
-    beginning with  db2_, mysql_, pg_, orcl_, sqlite_, etc., or similar
+    beginning with  db2_, mysql_, mysqli_, pg_, orcl_, sqlite_, etc., or similar
     PHP classes. A DatasourceManager object exists as the $db field
     variable of any subclass of Model.</li>
     <li>SQL should not appear in Yioop in any functions or classes other than
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index d495dc1..48fe758 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,34 +1,62 @@
 <div class="docs">
-<h1>Yioop Documentation v 0.94</h1>
+<h1>Yioop Documentation v 0.96</h1>
     <h2 id='toc'>Table of Contents</h2>
+    <ul>
+    <li><a href="#overview"><b>Overview</b></a>
     <ul>
         <li><a href="#quick">Getting Started</a></li>
-        <li><a href="#intro">Introduction</a></li>
-        <li><a href="#features">Feature List</a></li>
+        <li><a href="#intro">Introduction to Search Engines and Yioop</a></li>
+        <li><a href="#features">Yioop Feature List</a></li>
+    </ul>
+    </li>
+    <li><a href="#set-up"><b>Set-up</b></a>
+    <ul>
         <li><a href="#requirements">Requirements</a></li>
         <li><a href="#installation">Installation and Configuration</a></li>
         <li><a href="#upgrade">Upgrading Yioop</a></li>
         <li><a href="#files">Summary of Files and Folders</a></li>
-        <li><a href="#interface">Yioop Search and User Interface</a></li>
-        <li><a href="#mobile">Yioop Mobile Interface</a></li>
+    </ul>
+    </li>
+    <li><a href="#interface"><b>Search and User Interface</b></a>
+    <ul>
+        <li><a href="#search-basic">Search Basics</a></li>
+        <li><a href="#operators">Search Operators</a></li>
+        <li><a href="#result-formats">Result Formats</a></li>
+        <li><a href="#settings-signin">Settings and Signin</a></li>
+        <li><a href="#mobile">Mobile Interface</a></li>
         <li><a href="#passwords">Managing Accounts</a></li>
         <li><a href="#userroles">Managing Users and Roles</a></li>
-        <li><a href="#crawls">Managing Crawls</a></li>
+    </ul>
+    </li>
+    <li><a href="#crawl-results"><b>Crawling and Customizing Results</b></a>
+    <ul>
+        <li><a href="#crawls">Performing and Managing Crawls</a></li>
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
         <li><a href="#classifiers">Classifying Web Pages</a></li>
         <li><a href="#page-options">Page Indexing and Search Options</a></li>
         <li><a href="#editor">Results Editor</a></li>
         <li><a href="#sources">Search Sources</a></li>
         <li><a href="#machines">GUI for Managing Machines and Servers</a></li>
+    </ul>
+    </li>
+    <li><a href="#yioop-sites"><b>Building Sites with Yioop</b></a>
+    <ul>
         <li><a href="#localizing">Localizing Yioop to a New Language</a></li>
         <li><a href="#framework">Building a Site using Yioop as Framework</a>
         </li>
         <li><a href="#embedding">Embedding Yioop in an Existing Site</a></li>
-        <li><a href="#customizing">Customizing Yioop</a></li>
+    </ul>
+    </li>
+    <li><a href="#advanced-topics"><b>Advanced Topics</b></a>
+    <ul>
+        <li><a href="#customizing-code">Modifying Yioop Code</a></li>
         <li><a href="#commandline">Yioop Command-line Tools</a></li>
-        <li><a href="#references">References</a></li>
     </ul>
-    <h2 id="quick">Getting Started</h2>
+    </li>
+        <li><a href="#references"><b>References</b></a></li>
+    </ul>
+<h2 id="overview">Overview</h2>
+    <h3 id="quick">Getting Started</h3>
     <p>This document serves as a detailed reference for the
     Yioop search engine. If you want to get started using Yioop now,
     you probably want to first read the
@@ -38,7 +66,7 @@
     section followed by the more general <a
     href="#installation">Installation and Configuration</a> instructions.
     </p>
-    <h2 id="intro">Introduction</h2>
+    <h3 id="intro">Introduction</h3>
     <p>The Yioop search engine is designed to allow users
     to produce indexes of a web-site or a collection of
     web-sites. The number of pages a Yioop index can handle range from small
@@ -53,9 +81,8 @@
     eco-system, and when Yioop might be the right choice for your search
     engine needs. In the remainder of this document after the introduction,
     we discuss how to get and install Yioop; the files and folders used
-    in Yioop; user, role, search, subsearch, crawl,
-     and machine management in the Yioop system;
-    localization in the Yioop system; building a site using the Yioop
+    in Yioop; the various crawl, search and administration facilities in the
+    Yioop; localization in the Yioop system; building a site using the Yioop
     framework; embedding Yioop in an existing web-site;
     customizing Yioop; and the Yioop command-line tools.
     </p>
@@ -92,10 +119,10 @@
     common thread in the development of search systems: Search as an appliance,
     where you either have a separate search server and access it through either
     a web-based API or through function calls. Yioop has both a search
-    function API as well as a web API that returns
-    <a href="http://www.opensearch.org">Open Search RSS results</a>. These
-    can be used to embed Yioop within your existing site. If you want to
-    create a new search engine site, Yioop offers a web-based,
+    function API as well as a web API that can return
+    <a href="http://www.opensearch.org">Open Search RSS results</a> or
+    a JSON variant. These can be used to embed Yioop within your existing site.
+    If you want to create a new search engine site, Yioop offers a web-based,
     model-view-controller framework with a web-interface for localization
     that can serve as the basis for your app.
     </p>
@@ -139,8 +166,10 @@
     and a stemming algorithm such as [<a href="#P1980">P1980</a>] or a character
     n-gramming technique might be employed (a stemmer would extract the word
     jump from words such as jumps, jumping, etc; converting jumping to 3-grams
-    would make terms of length 3, i.e., jum, ump, mpi, pin, ing). Next a
-    statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
+    would make terms of length 3, i.e., jum, ump, mpi, pin, ing).
+    For some languages like Chinese, where spaces between words are not always
+    used, a segmenting algorithm like reverse maximal match might be used. Next
+    a statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
     (or at least the non-query time part of it) is computed to determine the
     importance of that word in that document compared to that word amongst
     all other documents. To do this calculation
@@ -190,9 +219,10 @@
     coordinator for crawls. It is called the <b>name server</b>. In addition
     to the name server, one might have several processes called
     <b>queue servers</b> that perform scheduling and indexing jobs, as well as
-    <b>fetcher</b> processes which are responsible for downloading pages.
-    Through the name server's web app, users can send messages to the
-    queue servers and fetchers. This interface writes message
+    <b>fetcher</b> processes which are responsible for downloading pages
+    and the page processing such as stemming, char-gramming and segmenting
+    mentioned above. Through the name server's web app, users can send messages
+    to the queue servers and fetchers. This interface writes message
     files that queue servers periodically looks for. Fetcher processes
     periodically ping the name server to find the name of the current crawl
     as well as a list of queue servers. Fetcher programs then periodically
@@ -219,10 +249,10 @@
     <p>As an example
     of how this scales, 2010 Mac Mini running a queue server
     program can schedule and index about 100,000 pages/hour. This corresponds
-    to the work of about 10 fetcher processes (which can be on the same
-    machine, if you have enough memory, or different ones). The checks by
-    fetchers on the name server are lightweight, so adding another machine with
-    a queue server and the corresponding additional fetchers allows one to
+    to the work of about 7 fetcher processes (which may be on different
+    machines -- roughly, you want 1GB and 1core/fetcher). The checks by fetchers
+    on the name server are lightweight, so adding another machine with a queue
+    server and the corresponding additional fetchers allows one to
     effectively double this speed. This also has the benefit of speeding up
     query processing as when a query comes in, it gets split into queries for
     each of the queue server's web apps, but the query only "looks" slightly
@@ -231,7 +261,12 @@
     the number queries that can be handled at a given time, Yioop installations
     can also be configured as "mirrors" which keep an exact copy of the
     data stored in the site being mirrored. When a query request comes into a
-    Yioop node, either it or any of its mirrors might handle it.
+    Yioop node, either it or any of its mirrors might handle it. Query
+    processing, for multi-word queries can actually be a major bottleneck if
+    you don't have many machines and you do have a large index. To further
+    speed this up, Yioop uses a hybrid inverted index/suffix tree
+    approach to store word lookups. The suffix tree ideas being motivated by
+    [<a href="#PTSHVC2011">PTSHVC2011</a>].
     </p>
     <p>Since a  multi-million page crawl involves both downloading from the
     web rapidly over several days, Yioop supports the ability to dynamically
@@ -250,9 +285,13 @@
     To reduce the effect of this Yioop supports domain name caching.
     </p>
     <p>Despite its simpler one-round model, Yioop does a number of things to
-    improve the quality of its search results. For each link extracted from a
-    page, Yioop creates a micropage which it adds to its index. This includes
-    relevancy calculations for each word in the link as well as an
+    improve the quality of its search results. While indexing, Yioop
+    can make use Lasso regression classifiers [<a href="#GLM2007">GLM2007</a>]
+    using data from earlier crawls to help label and/or rank documents
+    in the active crawl. Yioop also takes advantage of the link structure
+    that might exist between documents in a one-round way: For each link
+    extracted from a page, Yioop creates a micropage which it adds to its index.
+    This includes relevancy calculations for each word in the link as well as an
     [<a href="#APC2003">APC2003</a>]-based ranking of how important the
     link was. Yioop supports a number of iterators which can be thought of
     as implementing a stripped-down relational algebra geared towards
@@ -364,48 +403,15 @@
     This concludes the discussion of how Yioop fits into the current and
     historical landscape of search engines and indexes.
     </p>
-    <h2 id="features">Feature List</h2>
+    <h3 id="features">Feature List</h3>
     <p>
-    Here is short summary features of Yioop:
+    Here is a summary of the features of Yioop:
     </p>
     <ul>
+    <li><b>General</b>
+    <ul>
     <li>Yioop is an open-source, distributed crawler and search engine
     written in PHP.</li>
-    <li>It is capable of crawling and indexing small sites to sites or
-    collections of sites containing low hundreds of millions
-    of documents.</li>
-    <li>On a given machine it uses multi-curl to support many simultaneous
-    downloads of pages.</li>
-    <li>It has a web interface to select seed sites for crawls and to set what
-    sites crawls should not be crawled.</li>
-    <li>It obeys robots.txt files including Google and Bing extensions such
-    as the Crawl-delay and Sitemap directives as well as * and $ in allow and
-    disallow. It further supports the robots meta tag directives
-    NONE, NOINDEX, NOFOLLOW, NOARCHIVE, and NOSNIPPET. It also
-    supports anchor tags with rel="nofollow"
-    attributes. It also supports X-Robots-Tag HTTP headers.</li>
-    <li>Yioop supports crawl quotas for web sites. I.e., one can control
-    the number of urls/hour downloaded from a site.</li>
-    <li>Yioop can detect website congestion and slow down crawling
-    a site that it detects as congested.</li>
-    <li>It supports open web crawls, but through its web interface one can
-    configure it also to crawl only specifics site, domains, or collections
-    of sites and domains. One can customize a crawl using regex in disallowed
-    directives to crawl a site to a fixed depth.</li>
-    <li>It supports dynamically changing the allowed and disallowed
-    sites while a crawl is in progress.</li>
-    <li>It supports dynamically injecting new seeds site via a web
-    interface into the active crawl.</li>
-    <li>It has its own DNS caching mechanism.</li>
-    <li>Yioop supports the indexing of many different filetypes including:
-    HTML, Atom, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF,
-    sitemaps, SVG, XLSX, and XML. It has a web interface for controlling which
-    amongst these filetypes (or all of them) you want to index. It supports
-    also attempting to extract information from unknown filetypes.</li>
-    <li>Yioop supports subsearches geared towards presenting certain
-    kinds of media such as images, video, and news. The list of video and
-    news sites can be configured through the GUI. Yioop has a news_updater
-    process which can be used to automatically update news feeds hourly.</li>
     <li>Crawling, indexing, and serving search results can be done on a
     single machine or distributed across several machines.</li>
     <li>The fetcher/queue_server processes on several machines can be
@@ -413,61 +419,123 @@
     <li>Yioop installations can be created with a variety of topologies:
     one queue_server and many fetchers or several queue_servers and
     many fetchers.</li>
+    <li>Using web archives, crawls can be mirrored amongst several machines
+    to speed-up serving search results. This can be further sped-up
+    by using memcache or filecache.</li>
+    <li>Yioop comes with its own extendable model-view-controller
+    framework that you can use directly to create new sites that use
+    Yioop search technology. This framework also comes with a GUI
+    which makes it easy to localize strings and static pages.</li>
+    </ul>
+    </li>
+    <li><b>Search and User Interface</b>
+    <ul>
+    <li>Yioop supports subsearches geared towards presenting certain
+    kinds of media such as images, video, and news. The list of video and
+    news sites can be configured through the GUI. Yioop has a news_updater
+    process which can be used to automatically update news feeds hourly.</li>
     <li>Yioop determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
     <li>Yioop can be configured to display word suggestions as a user
     types a query. It can also suggest spell corrections for mis-typed
-    queries</li>
-    <li>Since version 0.70, Yioop indexes are positional rather than
+    queries. This feature can be localized</li>
+    <li>Yioop has been optimized to work well with smart phone web browsers
+    and with tablet devices.</li>
+    <li>Yioop supports the ability to filter out urls from search
+    results after a crawl has been performed. It also has the ability
+    to edit summary information that will be displayed for urls.</li>
+    <li>A given Yioop installation might have several saved crawls and
+    it is very quick to switch between any of them and immediately start
+    doing text searches.</li>
+    <li>Besides  the standard output of a web page with ten links, Yioop can
+    output query results in Open Search RSS format, a JSON variant of this
+    format, and also to query Yioop data via a function api.</li>
+    </ul>
+    </li>
+
+    <li><b>Indexing</b>
+    <ul>
+    <li>Yioop is capable of indexing small sites to sites or
+    collections of sites containing low hundreds of millions
+    of documents.</li>
+    <li>For indexes starting with v0.96, Yioop uses a hybrid inverted
+    index/suffix tree approach for word lookup to make multi-word
+    queries faster.</li>
+    <li>Yioop indexes are positional rather than
     bag of word indexes, and a index compression scheme called Modified9
     is used.</li>
-    <li>Yioop supports a web interface which makes
+    <li>Yioop has a web interface which makes
     it easy to combine results from several crawl indexes to create unique
     result presentations. These combinations can be done in a conditional
     manner using "if:" meta words.</li>
-    <li>Indexing occurs as crawling happens, so when a crawl is stopped,
-    it is ready to be used to handle search queries immediately.</li>
-    <li>Yioop supports an indexing plugin architecture to make it
-    possible to write one's own indexing modules that do further
-    post-processing.</li>
-    <li>Yioop has a web form that allows a user to control the recrawl
-    frequency for a page during a crawl.</li>
+    <li>Yioop supports the indexing of many different filetypes including:
+    HTML, Atom, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF,
+    sitemaps, SVG, XLSX, and XML. It has a web interface for controlling which
+    amongst these filetypes (or all of them) you want to index. It supports
+    also attempting to extract information from unknown filetypes.</li>
     <li>Yioop has a simple page rule language for controlling what content
     should be extracted from a page or record.</li>
+    <li>Indexing occurs as crawling happens, so when a crawl is stopped,
+    it is ready to be used to handle search queries immediately.</li>
+    <li>Yioop Indexes can be used to create classifiers which then
+    can be used in labeling and ranking future indexes.</li>
+    <li>Yioop come with a stemmer for English and Italian, and a
+    word segmenter for Chinese. It uses char-gramming for other languages.
+    Yioop has a simple architecture for adding stemmers for other languages.
+    </li>
     <li>Yioop uses a web archive file format which makes it easy to
     copy crawl results amongst different machines. It has a command-line
     tool for inspecting these archives if they need to examined
     in a non-web setting. It also supports command-line search querying
     of these archives.</li>
-    <li>Using web archives, crawls can be mirrored amongst several machines
-    to speed-up serving search results. This can be further sped-up
-    by using memcache or filecache.</li>
-    <li>Yioop supports the ability to filter out urls from search
-    results after a crawl has been performed. It also has the ability
-    to edit summary information that will be displayed for urls.</li>
-    <li>A given Yioop installation might have several saved crawls and
-    it is very quick to switch between any of them and immediately start
-    doing text searches.</li>
+    <li>Yioop supports an indexing plugin architecture to make it
+    possible to write one's own indexing modules that do further
+    post-processing.</li>
+    </ul>
+    </li>
+
+    <li><b>Web and Archive Crawling</b>
+    <ul>
+    <li>Yioop supports open web crawls, but through its web interface one can
+    configure it also to crawl only specifics site, domains, or collections
+    of sites and domains. One can customize a crawl using regex in disallowed
+    directives to crawl a site to a fixed depth.</li>
+    <li>Yioop uses multi-curl to support many simultaneous
+    downloads of pages.</li>
+    <li>Yioop obeys robots.txt files including Google and Bing extensions such
+    as the Crawl-delay and Sitemap directives as well as * and $ in allow and
+    disallow. It further supports the robots meta tag directives
+    NONE, NOINDEX, NOFOLLOW, NOARCHIVE, and NOSNIPPET. It also
+    supports anchor tags with rel="nofollow"
+    attributes. It also supports X-Robots-Tag HTTP headers.</li>
+    <li>Yioop has its own DNS caching mechanism.</li>
+    <li>Yioop supports crawl quotas for web sites. I.e., one can control
+    the number of urls/hour downloaded from a site.</li>
+    <li>Yioop can detect website congestion and slow down crawling
+    a site that it detects as congested.</li>
+    <li>Yioop supports dynamically changing the allowed and disallowed
+    sites while a crawl is in progress. Yioop also supports dynamically
+    injecting new seeds site via a web
+    interface into the active crawl.</li>
+    <li>Yioop has a web form that allows a user to control the recrawl
+    frequency for a page during a crawl.</li>
+    <li>Yioop keeps track of ETag: and Expires: HTTP headers to avoid
+    downloading content it already has in its index.</li>
     <li>Yioop supports importing data from ARC, WARC, database queries,
     MediaWiki XML, and ODP RDF files. It has generic importing facility
     to import text records such as access log, mail log, usenet posts, etc.,
     which are either not compressed, or compressed
-    using gzip or bzip2. It also supports re-indexing of data from WebArchives
-    created since version 0.66.</li>
-    <li>Yioop comes with its own extendable model-view-controller
-    framework that you can use directly to create new sites that use
-    Yioop search technology. This framework also comes with a GUI
-    which makes it easy to localize strings and static pages.</li>
-    <li>Besides standard output of a web page with ten links it is possible
-    to get query results in Open Search RSS format and also to query
-    Yioop data via a function api.</li>
-    <li>Yioop has been optimized to work well with smart phone web browsers
-    and with tablet devices.</li>
-    <li>Yioop has built-in support for image and video specific search.</li>
+    using gzip or bzip2. It also supports re-indexing of data from WebArchives.
+    </li>
     </ul>
-    <p><a href="#toc">Return to table of contents</a>.</p>
+    </li>
+
+    </ul>
+

-    <h2 id="requirements">Requirements</h2>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='set-up'>Set-up</h2>
+    <h3 id="requirements">Requirements</h3>
     <p>The Yioop search engine requires: (1) a web server, (2) PHP 5.3 or
     better (Yioop used only to serve search results from a pre-built index
     has been tested to work in PHP 5.2), (3) Curl libraries for downloading
@@ -532,14 +600,14 @@ files and both of these must be changed.</p>
     <p>As a final step, after installing the necessary software,
     <b>make sure to start/restart your web server and verify that
     it is running.</b></p>
-    <h3>Memory Requirements</h3>
+    <h4>Memory Requirements</h4>
     <p>In addition, to the prerequisite software listed above, Yioop also
     has certain memory requirements. By default bin/queue_server.php
-    requires 1400MB, bin/fetcher.php requires 850MB, and index.php requires
+    requires 1800MB, bin/fetcher.php requires 850MB, and index.php requires
     500MB. These  values are set near the tops of each of these files in turn
     with a line like:</p>
 <pre>
-ini_set("memory_limit","1600M");
+ini_set("memory_limit","1800M");
 </pre>
     <p>
     Often in a VM setting these requirements are somewhat steep. It is possible
@@ -552,7 +620,7 @@ ini_set("memory_limit","1600M");
     you should be able to trade-off memory requirements for speed.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='installation'>Installation and Configuration</h2>
+    <h3 id='installation'>Installation and Configuration</h3>
 <p>
 The Yioop application can be obtained using the
 <a href="http://www.seekquarry.com/?c=main&p=downloads">download page at
@@ -728,7 +796,7 @@ the installation is complete.
 </p>
     <p><a href="#toc">Return to table of contents</a>.</p>

-<h2 id='upgrade'>Upgrading Yioop</h2>
+    <h3 id='upgrade'>Upgrading Yioop</h3>
 <p>If you have an older version of Yioop that you would like to upgrade,
  make sure to back up your data. Then download the latest
 version of Yioop and unzip it to the location you would like. Set the
@@ -738,7 +806,7 @@ instructions on this, if you have forgotten how you did this.
 Knowing the old Work Directory location, should
 allow Yioop to complete the upgrade process.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='files'>Summary of Files and Folders</h2>
+    <h3 id='files'>Summary of Files and Folders</h3>
     <p>The Yioop search engine consists of three main
 scripts:</p>
 <dl>
@@ -765,7 +833,7 @@ an instance of Yioop on the ISP hosting your website. This website could
 serve search results without making use of either fetcher.php or
 queue_server.php. To perform a web crawl you need to use both
 of these programs however as well as the Yioop web site. This is explained in
-detail in the section Managing Crawls.
+detail in the section on <a href="#crawls">Performing and Managing Crawls</a>.
 </p>
 <p>The Yioop folder itself consists of several files and sub-folders.
 The file index.php as mentioned above is the main entry point into the Yioop
@@ -778,14 +846,20 @@ about who is crawling their sites. Here is a rough guide to what
 the Yioop folder's various sub-folders contain:
 <dl>
 <dt>bin</dt><dd>This folder is intended to hold command-line scripts
-which are used in conjunction with Yioop. In addition to the fetcher.php
-and queue_server.php script already mentioned, it contains arc_tool.php,
+and daemons which are used in conjunction with Yioop.
+In addition to the fetcher.php and queue_server.php script already mentioned,
+it contains arc_tool.php, classifier_tool.php, classifier_trainer.php,
 code_tool.php, mirror.php, news_updater.php and query_tool.php. arc_tool.php
 can be used to examine the contents of WebArchiveBundle's and
-IndexArchiveBundle's from the command line. code_tool.php is for use by
-developers to maintain the Yioop code-base in various ways. mirror.php can be
-used if you would like to create a mirror/copy of a Yioop installation.
-news_updater.php can be used to do hurly updates of news feed search sources
+IndexArchiveBundle's from the command line. classifier_tool.php
+is a command line tool for creating a classifier it can be used to perform
+some of the tasks that can also be done through the <a
+href="#classifiers">Web Classifier Interface</a>. classifier_trainer.php is
+a daemon used in the finalization stage of building a classifier.
+code_tool.php is for use by  developers to maintain the Yioop code-base in
+various ways. mirror.php can be used if you would like to create a
+mirror/copy of a Yioop installation. news_updater.php can be used to do hourly
+updates of news feed search sources
 in Yioop. Finally, query_tool.php can be used to
 run queries from the command-line.</dd>
 <dt>configs</dt><dd>This folder contains configuration files. You will
@@ -796,7 +870,7 @@ and how often, the queue_server and fetchers communicate, and which file types
 are supported by Yioop configure_tool.php is a command-line tool which
 can perform some of the configurations needed to get a Yioop installation
 running. It is only necessary in some virtual private server settings --
-the prefered way to configure Yioop is through the web interface.
+the preferred way to configure Yioop is through the web interface.
 createdb.php can be used to create a bare instance of
 the Yioop database with a root admin user having no password. This script is
 not strictly necessary as the database should be creatable via the admin panel;
@@ -807,11 +881,10 @@ There it is renamed as crawl.ini and serves as the initial set of sites to crawl
 until you decide to change these. The file token_tool.php is a tool which can
 be used to help in term extraction during crawls and for making trie's
 which can be used for word suggestions for a locale. To help word extraction
-this tool can generate in a locale folder (see below) a word gram bloom filter.
-Word grams are sequences of words that should be treated as a unit, for example,
-Honda Accord. token_tool.php can use either a raw Wikipedia page count dump
-file, or an actual Wikipedia dump file to extract from titles or redirects
-these word grams. For trie construction this tool can use a file that lists
+this tool can generate in a locale folder (see below) a word bloom filter.
+This filter can be used to segment strings into words for languages such as
+Chinese that don't use spaces to separate words in sentences.
+For trie and segmenter filter construction, this tool can use a file that lists
 words one on a line.
 </dd>
 <dt>controllers</dt><dd>The controllers folder contains all the controller
@@ -835,19 +908,18 @@ whose code gives an example of how to use the Yioop search function api.</dd>
 <dt>lib</dt><dd>This folder is short for library. It contains all the common
 classes for things like indexing, storing data to files, parsing urls, etc.
 lib contains six subfolders: <i>archive_bundle_iterators</i>,
-<i>compressors</i>, <i>index_bundle_iterators</i>, <i>indexing_plugins</i>,
-<i>processors</i>, and <i>stemmers</i>. The <i>archive_bundle_iterators</i>
+<i>classifiers</i>, <i>compressors</i>, <i>index_bundle_iterators</i>,
+<i>indexing_plugins</i>, <i>processors</i>. The <i>archive_bundle_iterators</i>
 folder has iterators for iterating over the objects of various kinds of
 web archive file formats, such as arc, wiki-media, etc.
 These iterators are used to iterate over such archives during
-a recrawl. The <i>compressors</i> folder contains classes that might be used
-to compress objects in a web_archive. The <i>index_bundle_iterator</i>
-folder contains a variety of iterators useful for iterating over lists of
-documents which might be returned during a query to the search engine.
-The <i>processors</i> folder contains processors to extract page summaries for
-a variety of different mimetypes. The <i>stemmers</i> folder is where word
-stemmers for different languages would appear. Right now only an
-English porter stemmer is present in this folder.</dd>
+a recrawl. The <i>classifier</i> folder contains code for training
+classifiers used by Yioop. The <i>compressors</i> folder contains classes that
+might be used to compress objects in a web_archive. The
+<i>index_bundle_iterator</i> folder contains a variety of iterators useful
+for iterating over lists of documents which might be returned during a query
+to the search engine. The <i>processors</i> folder contains processors to
+extract page summaries for a variety of different mimetypes.</dd>
 <dt>locale</dt><dd>This folder contains the default locale data which comes
 with the Yioop system. A locale encapsulates data associated with a
 language and region. A locale is specified by an
@@ -874,12 +946,10 @@ language in question for spell check purposes, and roman_array for mapping
 between roman alphabet and the character system of the locale in question;
 <i>suggest-trie.txt.gz</i>, a <a href="http://en.wikipedia.org/wiki/Trie"
 >Trie data structure</a> used for search bar word suggestions;
-and <i>tokenizer.php</i>, which either specifies the number of characters for
-this language to constitute a char gram or contains a stemmer class used to stem
-terms for this language. This folder might also contain a Bloom filter file
-with a name like all_word_grams.ftr which would be used to do word gramming of
-sequences of words that should be treated as a unit, for example, "Honda Accord"
-or "Bill Clinton".
+and <i>tokenizer.php</i>, which can specify the number of characters for
+this language to constitute a char gram or might contain segmenter to split
+strings into words for this language or a stemmer class used to stem terms for
+this language.
 </dd>
 <dt>models</dt><dd>This folder contains the subclasses of Model used by
 Yioop Models are used to encapsulate access to secondary storage.
@@ -935,7 +1005,7 @@ the WORK DIRECTORY's sub-folder contain:
 <dt>app</dt><dd>This folder is used to contain your overrides to
 the views, controllers, models, resources, etc. For example, if you
 wanted to change how the search results were rendered, you could
-ass a views/search_view.php file to the app folder and Yioop would use
+add a views/search_view.php file to the app folder and Yioop would use
 it rather than the one in the Yioop base directory's views folder.
 Using the app dir makes it easier to have customizations that won't get
 messed up when you upgrade Yioop.</dd>
@@ -955,6 +1025,11 @@ QueueBundleUNIX_TIMESTAMP folders.</dd>
 being used then a seek_quarry.db file is stored in the data folder. In Yioop,
 the database is used to manage users, roles, locales, and crawls. Data for
 crawls themselves are NOT stored in the database.</dd>
+<dt>locale</dt><dd>This is generally a copy of the locale folder mentioned
+earlier. In fact, it is the version that Yioop will try to use first.
+It contains any customizations that have been done to locale for this instance
+of Yioop.
+</dd>
 <dt>log</dt><dd>When the fetcher and queue_server are run as daemon processes
 log messages are written to log files in this folder. Log rotation is also done.
 These log files can be opened in a text editor or console app.</dd>
@@ -991,7 +1066,18 @@ also see folders 0-temp, 1-temp, etc.</dd>
     <p><a href="#toc">Return to table of contents</a>.</p>


-    <h2 id='interface'>The Yioop Search and User Interface</h2>
+    <h2 id='interface'>Search and User Interface</h2>
+<p>At this point one hopefully has installed Yioop.
+If you used one of the <a href="?c=main&p=install">install guides</a>,
+you may also have performed a simple crawl. We are now going to
+describe some of the basic search features of Yioop as well as the Yioop
+administration interface. We will describe
+how to perform crawls with Yioop in more detail in the
+<a href="?c=main&p=documentation#crawl-results"
+>Crawling and Customizing Results</a> chapter. If you do not have a crawl
+available, you can test some of these features on the <a
+href="http://www.yioop.com/">Yioop Demo Site</a>.</p>
+<h3 id='search-basic'>Search Basics</h3>
 <p>
 The main search form for Yioop looks like:
 </p>
@@ -1023,6 +1109,11 @@ through the Manage Locale activity. A typical search results might look like:
 </p>
 <img src='resources/SearchResults.png' alt='Example Search Results'
 width="70%"/>
+<p>Hovering over the Score of a search results reveals its component scores.
+These might include: Rank, Relevance, Proximity, as well as any Use to Rank
+Classifier scores.</p>
+<img src='resources/ScoreToolTip.png' alt='Example Score Components Tool Tip'/>
+
 <p>If one slightly mistypes a query term, Yioop can sometimes suggest
 a spelling correction:</p>
 <img src='resources/SearchSpellCorrect.png' alt='Example Search Results
@@ -1131,6 +1222,8 @@ news items in the current language will be displayed, roughly ranked by
 recentness. If one has RSS media sources which are set to be from
 different locales, then this will be taken into account on this blank query
 News page.</p>
+<p><a href="#toc">Return to table of contents</a>.</p>
+<h3 id='operators'>Search Operators</h3>
 <p>Turning now to the topic of how to enter a query in Yioop:
 A basic query to the Yioop search form is typically a sequence of
 words seperated by whitespace. This will cause Yioop to compute a
@@ -1293,6 +1386,7 @@ would  multiply scores satisfying Chris Pollett  and on wikipedia.org by
 <p>Although we didn't say it next to each query form above, if it makes sense,
 there is usually an <i>all</i> variant to a form. For example, os:all returns
 all documents from servers for which os information appeared in the headers.</p>
+<h3 id='result-formats'>Result Formats</h3>
 <p>
 In addition to using the search form interface to query Yioop, it is also
 possible to query Yioop and get results in Open Search RSS format. To
@@ -1300,7 +1394,16 @@ do that you can either directly type a URL into your browser of the form:</p>
 <pre>
 http://my-yioop-instance-host/?f=rss&amp;q=query+terms
 </pre>
-<p>Or you can write AJAX code that makes requests of URLs in this format.</p>
+<p>Or you can write AJAX code that makes requests of URLs in this format.
+Although, there is no official Open Search JSON format, one can get a JSON
+object with the same structure as the RSS search results using a
+query to Yioop such as:
+</p>
+<pre>
+http://my-yioop-instance-host/?f=json&amp;q=query+terms
+</pre>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+<h3 id='settings-signin'>Settings and Signin</h3>
 <p>In
 the corner of the page with the main search form is a Settings-Signin element:
 </p>
@@ -1347,7 +1450,9 @@ Over the next several sections we will discuss each of the Yioop admin
 activities in turn. Before we do that we make a couple remarks about using
 Yioop from a mobile device.
 </p>
-    <h2 id='mobile'>Yioop Mobile Interface</h2>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+
+    <h3 id='mobile'>Mobile Interface</h3>
     <p>Yioop's user interface is designed to display reasonably well as is
     in table devices such as the iPad. For smart phones, such as
     iPhone, Android, Blackberry, or Windows Phone, Yioop has a separate
@@ -1369,7 +1474,7 @@ Yioop from a mobile device.
     except for the above minor changes, these instructions will also apply to
     the mobile setting.
     </p>
-    <h2 id='passwords'>Managing Accounts</h2>
+    <h3 id='passwords'>Managing Accounts</h3>
     <p>By default, when a user first signs in to the Yioop admin
     panel the current activity is the Manage Account activity. For now,
     this activity just lets user's change their password using the form
@@ -1378,7 +1483,7 @@ Yioop from a mobile device.
 <img src='resources/ChangePassword.png' alt='Change Password Form'/>

     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='userroles'>Managing Users and Roles</h2>
+    <h3 id='userroles'>Managing Users and Roles</h3>
     <p>The manage user and manage role activities have similar looking
     forms. The Manage User activity looks like:</p>
 <img src='resources/ManageUser.png' alt='The Manage User form'/>
@@ -1399,8 +1504,8 @@ Yioop from a mobile device.

     <p><a href="#toc">Return to table of contents</a>.</p>

-
-    <h2 id='crawls'>Managing Crawls</h2>
+<h2 id='crawl-results'>Crawling and Customizing Results</h2>
+    <h3 id='crawls'>Performing and Managing Crawls</h3>
     <p>The Manage Crawl activity in Yioop looks like:</p>
 <img src='resources/ManageCrawl.png' alt='Manage Crawl Form'/>
     <p>
@@ -1444,7 +1549,7 @@ Yioop from a mobile device.
     fetchers, and because the on screen display refreshes only every 20 seconds
     or so.
     </p>
-    <h3 id="prereqs">Prerequisites for Crawling</h3>
+    <h4 id="prereqs">Prerequisites for Crawling</h4>
     <p>Before you can start a new crawl, you need to run at least one
     queue_server.php script and you need to run at least one fetcher.php script.
     These can be run either from the same Yioop installation or from
@@ -1510,7 +1615,7 @@ php fetcher.php stop</pre>
     is not possible to resume the crawl. We have now described what is
     necessary to perform a crawl we now return to how to set the
     options for how the crawl is conducted.</p>
-    <h3>Common Crawl and Search Configurations</h3>
+    <h4>Common Crawl and Search Configurations</h4>
     <p>When testing Yioop, it is quite common just to have one instance
     of the fetcher and one instance of the queue_server running, both on
     the same machine and same installation of Yioop. In this subsection
@@ -1550,7 +1655,7 @@ php fetcher.php start 5
     respectively. It is completely possible to copy these subfolders to
     a SSD and use symlinks to them under the original crawl directory to
     enhance Yioop's search performance.</p>
-    <h3>Specifying Crawl Options and Modifying Options of the Active Crawl</h3>
+    <h4>Specifying Crawl Options and Modifying Options of the Active Crawl</h4>
     <p>As we pointed out above, next to the Start Crawl button is an Options
     link. Clicking on this link, let's you set various aspect of how
     the next crawl should be conducted. If there is
@@ -1577,7 +1682,7 @@ php fetcher.php start 5
     <a href="http://rdf.dmoz.org/"
     >Open Directory Project RDF file</a>, . In the next subsection, we describe
     new web crawls and then return to archive crawls subsection after that.</p>
-    <h4>Web Crawl Options</h4>
+    <h5>Web Crawl Options</h5>
     <p>
     On the web crawl tab, the first form field, "Get Crawl Options From",
     allows one to read in crawl options either from the default_crawl.ini file
@@ -1666,7 +1771,7 @@ http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
     for the Options form. </p>

-    <h4 id="archive-crawl">Archive Crawl Options</h4>
+    <h5 id="archive-crawl">Archive Crawl Options</h5>
     <p>We now consider how to do crawls of previously obtained archives.
     From the initial crawl options screen, clicking on the Archive Crawl
     tab gives one the following form:</p>
@@ -1864,8 +1969,7 @@ encoding = "ASCII";
     </p>

     <p><a href="#toc">Return to table of contents</a>.</p>
-
-    <h2 id='mixes'>Mixing Crawl Indexes</h2>
+    <h3 id='mixes'>Mixing Crawl Indexes</h3>
     <p>Once you have performed a few crawls with Yioop, you can use the Mix
     Crawls activity to create mixture of your crawls.
     This section describes how to create crawl mixes which are processed
@@ -1927,7 +2031,7 @@ encoding = "ASCII";
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>

-    <h2 id='classifiers'>Classifying Web Pages</h2>
+    <h3 id='classifiers'>Classifying Web Pages</h3>
     <p>Sometimes searching for text that occurs within a page isn't enough to
     find what one is looking for. For example, the relevant set of documents
     may have many terms in common, with only a small subset showing up on any
@@ -1966,7 +2070,7 @@ encoding = "ASCII";
     done until you've added some training examples. We'll discuss how to add
     new examples next, then return to the Finalize link.</p>

-    <h3>Editing a Classifier</h3>
+    <h4>Editing a Classifier</h4>

     <p>Clicking on the Edit action link takes you to a new page where you can
     change a classifier's class label, view some statistics, and provide
@@ -2015,7 +2119,7 @@ encoding = "ASCII";
     be gone, but none of your progress toward building the training set will be
     lost.</p>

-    <h3>Finalizing a Classifier</h3>
+    <h4>Finalizing a Classifier</h4>

     <p>Editing a classifier adds new labeled examples to the training set,
     providing the classifier with a more complete picture of the kinds of
@@ -2026,7 +2130,9 @@ encoding = "ASCII";
     even a few hundred example documents. It wouldn't be practical to wait for
     the classifier to re-train each time you add a new example, so you have to
     explicitly tell the classifier that you're done adding examples for now by
-    clicking on the Finalize action link on the classifier management page.</p>
+    clicking on the Finalize action link either next to the Load button
+    on the edit classifier page or next to the given classifier's name on
+    the classifier management page.</p>

     <p>Clicking this link will kick off a separate process that trains the
     classifier in the background. When the page reloads, the Finalize link
@@ -2037,15 +2143,18 @@ encoding = "ASCII";
     however, make further changes to the classifier's training set, or start a
     new crawl that makes use of the classifier. When the classifier finishes
     its training phase, the Finalizing message will be replaced by one that
-    reads &quot;Finalized&quot; (you'll have to reload the page, as it will not
-    update itself), indicating that the classifier is ready for use.</p>
+    reads &quot;Finalized&quot indicating that the classifier is ready for
+    use.</p>

-    <h3>Using a Classifier</h3>
+    <h4>Using a Classifier</h4>

-    <p>Using a classifier is as simple as selecting the classifier's label
-    on the Page Options activity, under the &quot;Classifiers to Apply&quot;
+    <p>Using a classifier is as simple as checking the
+    "Use to Classify" or "Use to Rank" checkboxes next to the classifier's label
+    on the Page Options activity, under the &quot;Classifiers and Rankers&quot;
     heading. When the next crawl starts, the classifier (and any other selected
-    classifiers) will be applied to each fetched page, and if a page is
+    classifiers) will be applied to each fetched page. If "Use to Rank"
+    is checked then the classifier score for that page will be recorded. If
+    "Use to Classify" is checked and if a page is
     determined to belong to a target class, it will have several meta words
     added. As an example, if the target class is &quot;spam&quot;, and a page
     is determined to belong to the class with probability .79, then the
@@ -2069,12 +2178,12 @@ encoding = "ASCII";

     <p><a href="#toc">Return to table of contents</a>.</p>

-    <h2 id='page-options'>Page Indexing and Search Options</h2>
+    <h3 id='page-options'>Page Indexing and Search Options</h3>
     <p>Several properties about how web pages are indexed and how pages are
     looked up at search time can be controlled by clicking on Page Options.
     There are three tabs for this activity: Crawl Time, Search Time, and Test
     Options. We will discuss each of these in turn.</p>
-    <h3>Crawl Time Tab</h3>
+    <h4>Crawl Time Tab</h4>
     <p>Clicking on Page Options leads to the default Crawl Time Tab:</p>
 <img src='resources/PageOptionsCrawl.png' alt='The Page Options Crawl form'/>
     <p>This tab controls some aspects about how a page is processed and indexed
@@ -2090,14 +2199,22 @@ encoding = "ASCII";
     The Byte Range to Download dropdown controls how many bytes out of
     any given web page should be downloaded. Smaller numbers reduce the
     requirements on disk space needed for a crawl; bigger numbers would
-    tend to improve the search results. The Cache whole crawled pages
+    tend to improve the search results. If whole pages are being cached,
+    these downloaded bytes are stored in archives with the fetcher.
+    The Max Page Summary Length in Bytes controls how many of the total
+    bytes can be used to make a page summary which is sent to the
+    queue server. It is only words in this summary which can actually be
+    looked up in search result. Care should be taken in making this
+    value larger as it can increase the both the RAM memory requirements
+    (you might have to change the memory_limit variable at the start of
+    queue_server.php to prevent crashing) while crawling and it can slow
+    the crawl process down.  The Cache whole crawled pages
     checkbox says whether to when crawling to keep both the
     whole web page downloaded as well as the summary extracted from the
     web page (checked) or just to keep the page summary (unchecked).
-    The next dropdown,
-    Allow Page Recrawl After, controls how many days that Yioop keeps
-    track of all the URLs that it has downloaded from. For instance, if one
-    sets this dropdown to 7, then after seven days Yioop will clear its
+    The next dropdown, Allow Page Recrawl After, controls how many days that
+    Yioop keeps track of all the URLs that it has downloaded from. For instance,
+    if one sets this dropdown to 7, then after seven days Yioop will clear its
     Bloom Filter files used to store which urls have been downloaded, and it
     would be allowed to recrawl these urls again if they happened in links. It
     should be noted that all of the information from before the seven
@@ -2129,16 +2246,30 @@ encoding = "ASCII";
     check the unknown checkbox in the upper left of this list.
     </p>
     <p>
-    The Classifiers to Apply checkboxes allow you to select the classifiers
-    that will be used to classify pages during a crawl. Each classifier (see
+    The Classifiers and Rankers checkboxes allow you to select the classifiers
+    that will be used to classify or rank pages. Each classifier (see
     the <a href="#classifiers">Classifiers</a> section for details) is
-    represented in the list by its class label and a checkbox. Checking the box
+    represented in the list by its class label and two checkboxes. Checking
+    the box under Use to classify
     indicates that the associated classifier should be used (made active)
-    during the next crawl. Each active classifier is run on each page
-    downloaded during a crawl, and if the page is determined to belong to the
-    class that the classifier has been trained to recognize, then a meta word
-    like &quot;class:<i>label</i>&quot;, where <i>label</i> is the class label,
-    is added to the page summary.
+    during the next crawl for classifying, checking the "Use to Rank"
+    indicates that the classifier should be be used (made active) and its score
+    for the document should be stored so that it can be used as part of
+    the search time score.  Each active classifier is run on each page
+    downloaded during a crawl. If "Use to Crawl" was checked and the page
+    is determined to belong to the class that the classifier has been trained
+    to recognize, then a meta word like &quot;class:<i>label</i>&quot;,
+    where <i>label</i> is the class label,
+    is added to the page summary. For faster access to pages that contain
+    a single term and a label, for example, pages that contain "rich" and
+    are labeled as "non-spam", Yioop actually uses the first character
+    of the label "non-spam" and embeds it as part of the term ID of "rich" on
+    "non-spam" pages with the word "rich". To ensure this speed-up can be
+    used it is useful to make sure ones classifier labels begin with different
+    first characters. If "Use to Rank" is checked then when a classifier is
+    run on the page, the score from the classifier is recorded. When a search
+    is done that might retrieve this page, this score is then used as one
+    component of the overall score that this page receives for the query.
     </p>
     <p>
     The Indexing Plugins checkboxes allow you to select which plugins
@@ -2152,7 +2283,9 @@ encoding = "ASCII";
     called by Yioop for each active plugin on each page downloaded. The second
     method is called during the stop crawl process of Yioop.
     </p>
-    <p>We now return to the Page Field Extraction Rules textarea. Commands
+    <h4 id='extraction'>Page Field Extraction Language</h4>
+    <p>We now return to the Page Field Extraction Rules textarea of
+    the Page Options - Crawl Time tab. Commands
     in this area allow a user to control what data is extracted from
     a summary of a page. The textarea allows you to do things like modify the
     summary, title, and other fields extracted from a page summary;
@@ -2160,15 +2293,15 @@ encoding = "ASCII";
     which will appear when a cache of a page is shown. Page Rules are
     especially useful for extracting data from generic text archives and
     database archives. How to import such archives is described in the
-    Archive Crawls sub-section of <a href="#crawls">Managing Crawls</a>.
-    The input to the page rule processor is an asscociative array that results
-    from Yioop doing initial processing on a page. To see what this array looks
-    like one can take a web page and paste it into the form on the Test Options
-    tab.  There are two types of page rule statements that a user can define:
-    command statements and assignment statements. In addition, a semicolon
-    ';' can be used to indicate the rest of a line is a comment. Although
-    the initial textarea for rules might appear small. Most modern
-    browsers allow one to resize this area by dragging on th
+    Archive Crawls sub-section of <a href="#crawls">Performing and
+    Managing Crawls</a>. The input to the page rule processor is an
+    asscociative array that results from Yioop doing initial processing on a
+    page. To see what this array looks like one can take a web page and paste
+    it into the form on the Test Options tab.  There are two types of page rule
+    statements that a user can define: command statements and assignment
+    statements. In addition, a semicolon ';' can be used to indicate the rest
+    of a line is a comment. Although the initial textarea for rules might appear
+    small. Most modern browsers allow one to resize this area by dragging on the
     lower right hand corner of the area. This makes it relatively easy
     to see large sets of rules.
     </p>
@@ -2323,7 +2456,7 @@ encoding = "ASCII";
     unset(thread)
     unset(link_thread)
     </pre>
-    <h3>Search Time Tab</h3>
+    <h4>Search Time Tab</h4>
 <p>The Page Options Search Time tab looks like:</p>
 <img src='resources/PageOptionsSearch.png' alt='The Page Options Search form'/>
 <p>The Search Page Elements and Links control group is used to tell
@@ -2366,7 +2499,7 @@ Server Alpha controls the number alpha.
 </p>
 <p>The Save button of course saves any changes you
     make on this form.</p>
-    <h3>Test Options Tab</h3>
+    <h4>Test Options Tab</h4>
 <p>The Page Options Test Options tab looks like:</p>
 <img src='resources/PageOptionsTest.png' alt='The Page Options Test form'/>
 <p>In the Type dropdown one can select a
@@ -2385,7 +2518,7 @@ from the page together with their positions in the document. Finally,
 a list of meta words that the document has are listed. Either extracted
 terms or meta-word could be used to look up this document in a Yioop index.</p>

-    <h2 id='editor'>Results Editor</h2>
+    <h3 id='editor'>Results Editor</h3>
     <p>Sometimes after a large crawl one finds that there are some results
     that appear that one does not want in the crawl or that the
     summary for some result is lacking. The Result Editor activity allows
@@ -2430,7 +2563,7 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     http://www.cs.sjsu.edu/faculty/pollett/ would not appear in search
     results.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='sources'>Search Sources</h2>
+    <h3 id='sources'>Search Sources</h3>
     <p>The Search Sources activity is used to manage the media sources
     available to Yioop, and also to control the subsearch links displayed
     on the top navigation bar. The Search Sources activity looks like:</p>
@@ -2511,7 +2644,7 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     one can then in Manage Locales navigate to other locales, and fill
     in translations for them as well, if desired.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='machines'>GUI for Managing Machines and Servers</h2>
+    <h3 id='machines'>GUI for Managing Machines and Servers</h3>
     <p>Rather than use the command line as described in the
     <a href="#prereqs">Prerequisites for Crawling</a> section, it is possible
     to start/stop and view the log files of queue servers and fetcher
@@ -2555,7 +2688,8 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     the server/fetcher. This switch is green if the server/fetcher is running
     and red otherwise. A similar On/Off switch is present to turn on
     and off mirroring on a machine that is acting as a mirror.</p>
-    <h2 id='localizing'>Localizing Yioop to a New Language</h2>
+    <h2 id='yioop-sites'>Building Sites with Yioop</h2>
+    <h3 id='localizing'>Localizing Yioop to a New Language</h3>
     <p>The Manage Locales activity can be used to configure Yioop
     for use with different languages and for different regions. If you decide
     to customize your Yioop installation by adding files to
@@ -2634,36 +2768,49 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     So you cannot find these ids in the source code. The tooltip trick
     mentioned above does not work for database string ids.</p>

-    <h3>Adding a stemmer or supporting character
-    n-gramming for your language</h3>
+    <h4>Adding a stemmer, segmenter or supporting character
+    n-gramming for your language</h4>
     <p>Depending on the language you are localizing to, it may make sense
     to write a stemmer for words that will be inserted into the index.
     A stemmer takes inflected or sometimes derived words and reduces
     them to their stem. For instance, jumps and jumping would be reduced to
     jump in English. As Yioop crawls it attempts to detect the language of
     a given web page it is processing. If a stemmer exists for this language
-    it will call the stemmer's stem($word) method on each word it extracts
-    from the document before inserting information about it into the index.
-    Similarly, if an end-user is entering a simple conjunctive search query
-    and a stemmer exists for his language settings, then the query terms will
+    it will call the Tokenizer class's stem($word) method on each word it
+    extracts from the document before inserting information about it into the
+    index. Similarly, if an end-user is entering a simple conjunctive search
+    query and a stemmer exists for his language settings, then the query terms will
     be stemmed before being looked up in the index. Currently, Yioop comes
-    with only an English language stemmer that uses the Porter Stemming
-    Algorithm [<a href="#P1980">P1980</a>]. This stemmer is located in the
+    with only an English  and Italian language stemmers. The English stemmer
+    uses the Porter Stemming Algorithm [<a href="#P1980">P1980</a>], the
+    Italian Stemmer is based on the algorithm presented at
+    <a href="http://snowball.tartoros.org/">snowball.tartoros.org</a>.
+    Stemmers should be written as a static method located in the
     file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php .
-    The [<a href="#P1980">P1980</a>] link
+    The snowball.tartoros.org link
     points to a site that has source code for stemmers for many other languages
     (unfortunately,  not written in PHP). It would not be hard to port these
     to PHP and then add modify the tokenizer.php file of the
     appropriate locale folder. For instance, one
     could modify the file
     WORK_DIRECTORY/locale/fr-FR/resources/tokenizer.php
-    to contain a class FrStemmer with method
+    to contain a class FrTokenizer with a static method
     stem($word) if one wanted to add a stemmer for French.
     </p>
-    <p>In addition to supporting the ability to add stemmers, Yioop also
-    supports a default technique which can be used in lieu of a stemmer
-    called character n-grams. When used this technique segments text into
-    sequences of n characters which are then stored in Yioop as a term.
+    <p>The class inside tokenizer.php can also be used by Yioop to
+    do word segmentation. This is the process of splitting a string of words
+    without spaces in some language into its component words. Yioop
+    comes with an example segmenter for the zh-CN (Chinese) locale. It works
+    by starting at the ned of the string and trying to greedily find the
+    longest word that can be matched with the portion of the suffix of the
+    string that has been processed yet (reverse maximal match). To do this
+    it makes use of a word Bloom filter as part of how it detects if a string
+    is a word or not. We describe how to make such filter using token_tool.php
+    in a moment.</p>
+    <p>In addition to supporting the ability to add stemmers and segmenters,
+    Yioop also supports a default technique which can be used in lieu of a
+    stemmer called character n-grams. When used this technique segments text
+    into sequences of n characters which are then stored in Yioop as a term.
     For instance if n were 3 then the word "thunder" would be split
     into "thu", "hun", "und", "nde", and "der" and each of these would be
     asscociated with the document that contained the word thunder.
@@ -2677,10 +2824,10 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     the length of string to use in doing char-gramming. If you add a
     language to Yioop and want to use char gramming merely add a tokenizer.php
     to the corresponding locale folder with such a line in it.</p>
-    <h3 id="token_tool">Using token_tool.php to improve search performance and
-    relevance for your language</h3>
-    <p>configs/token_tool is used to create suggest word dictionaries and 'n'
-    word gram filter files for the Yioop search engine. To create either of
+    <h4 id="token_tool">Using token_tool.php to improve search performance and
+    relevance for your language</h4>
+    <p>configs/token_tool.php is used to create suggest word dictionaries and
+    word filter files for the Yioop search engine. To create either of
     these items, the user puts a source file in Yioop's WORK_DIRECTORY/prepare
     folder. Suggest word dictionaries are used to supply the content of the
     dropdown of search terms that appears as a user is entering a query in
@@ -2702,39 +2849,19 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     occurences of that word in the document.
     </p>
     <p>
-    token_tool.php can also be used to make filter files. A filter file is used
-    to detect when words in a language should be treated as a unit when
-    extracting text during a crawl. For example, Bill Clinton is 2 word gram
-    which should be treated as unit because it is a particular person.
+    token_tool.php can also be used to make filter files used by a word
+    segmenter. To make a filter file
     token_tool.php is run from the command line as:
     </p>
     <pre>
-    php token_tool.php filter wiki_file lang locale n extract_type <?php
-    ?>max_to_extract
+    php token_tool.php segment-filter dictionary_file locale
     </pre>
     <p>
-    where wiki_file is a wikipedia xml file or a bz2  compressed xml file whose
-    urls or wiki page count dump file which will be used to determine the
-    n-grams, lang is an Wikipedia language tag,  locale is the IANA language
-    tag of locale to store the results for (if different from lang, for example,
-    en-US versus en for  lang), n is the number of words in a row to consider,
-    extract_type is where from Wikipedia source to extract:
+    Here dictionary_file should be a text file with one word/line,
+locale is the IANA language tag of the locale to store the results for.
     </p>
-    <pre>
-    0 = title's,
-    1 = redirect's,
-    2 = page count dump wikipedia data,
-    3 = page count dump wiktionary data.
-    </pre>
-    <p>The filter file produced by the above command can be found in:</p>
-    <pre>
-    WORK_DIRECTORY/locale/which-locale-tag-you-used/resources/n_word_grams.ftr
-    </pre>
-    <p>Rather than pick a specific n such as 2 or 3 you can also use the
-    keyword <tt>all</tt> to get max_to_extract many n-grams of up to
-    arbitrary length.</p>

-    <h3>Obtaining data sets for token_tool.php</h3>
+    <h4>Obtaining data sets for token_tool.php</h4>
     <p>
     Many word lists with frequencies are obtainable on the web for free
     with Creative Commons licenses. A good starting point is:</p>
@@ -2744,38 +2871,8 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     </pre>
     <p>A little script-fu can generally take such a list and output it with the
     line format of "word/phrase space frequency"  needed by
-    token_tool.php</p>
-    <p>
-    For filter files, raw page count dumps can be found at:</p>
-    <pre>
-    <a href="http://dumps.wikimedia.org/other/pagecounts-raw/"
-    >http://dumps.wikimedia.org/other/pagecounts-raw/</a>
-    </pre>
-    <p>These probably give the best n-gram or all gram results, usually
-    in a matter of minutes; nevertheless, this tool does support trying to
-    extract  similar data from Wikipedia dumps. This can take hours.</p>
-    <p>For Wikipedia dumps, one can go to</p>
-    <pre>
-    <a href="http://dumps.wikimedia.org/enwiki/"
-    >http://dumps.wikimedia.org/enwiki/</a>
-    </pre>
-    <p>
-    and obtain a dump of the English Wikipedia (similar for other languages).
-    This page lists all the dumps according to date they were taken. Choose any
-    suitable date or the latest. A link with a label such as 20120104/,
-    represents a dump taken on  01/04/2012. Click this link to go in turn to a
-    page which has many links based on the type of content you are looking for.
-    For this tool you are interested in files under "Recombine all pages,
-    current versions only".</p>
-    <p>
-    Beneath this we might find a link with a name like:</p>
-    <pre>
-    enwiki-20120104-pages-meta-current.xml.bz2
-    </pre>
-    <p>
-    which is a file that could be processed by this tool.
-    </p>
-    <h3>Spell correction and romanized input with locale.js</h3>
+    token_tool.php and as the word/line format used for filter files.</p>
+    <h4>Spell correction and romanized input with locale.js</h4>
     <p>Yioop supports the ability to suggest alternative queries
     after a search is performed. These queries are mainly restricted to
     fixing typos in the original query. In order to calculate
@@ -2806,7 +2903,7 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     of the transliteration. An example of doing this is given for the
     Telugu locale in Yioop.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='framework'>Building a Site using Yioop as Framework</h2>
+    <h3 id='framework'>Building a Site using Yioop as Framework</h3>
     <p>The Yioop code base can serve as the code base for new custom search
     web sites. The web-app portion of Yioop uses a model-view-controller (MVC)
     framework. In this set-up, sub-classes of the Model class should handle
@@ -2966,7 +3063,7 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     <a href="#localizing">Localizing Yioop</a>.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='embedding'>Embedding Yioop in an Existing Site</h2>
+    <h3 id='embedding'>Embedding Yioop in an Existing Site</h3>
     <p>One use-case for Yioop is to serve search result for your
     existing site. There are three common ways to do this: (1)
     On your site have a web-form or links with your installation of Yioop
@@ -2983,7 +3080,7 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     access methods (2) or (3) and don't want users to be able to access the
     Yioop search results via its built in web form. We will now spend a moment
     to look at each of these access methods in more detail...</p>
-    <h3>Accessing Yioop via a Web Form</h3>
+    <h4>Accessing Yioop via a Web Form</h4>
     <p>A very minimal code snippet for such a
     form would be:</p>
     <pre>
@@ -3026,12 +3123,12 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     web page that is returned; URL is the url of the web page you want to look
     up in the cache.
     </p>
-    <h3>Accessing Yioop and getting and OpenSearch RSS Response</h3>
-    <p>The same basic urls as above can return RSS results simply by appending
-    to the end of the them &ampf=rss. This of course only makes sense for
-    usual and related url queries -- cache queries return web-pages not
-    a list of search results. Here is an example of what a portion of an RSS
-    result might look like:</p>
+    <h4>Accessing Yioop and getting and OpenSearch RSS or JSON Response</h4>
+    <p>The same basic urls as above can return RSS or JSON results simply by
+    appending to the end of the them &amp;f=rss or &amp;f=json. This of course
+    only makes sense for usual and related url queries -- cache queries return
+    web-pages not a list of search results. Here is an example of what a portion
+    of an RSS result might look like:</p>
 <pre>
 &lt;?xml version="1.0" encoding="UTF-8" ?&gt;
 &lt;rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
@@ -3069,7 +3166,7 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     <p>Notice the opensearch: tags tell us the totalResults, startIndex and
     itemsPerPage. The opensearch:Query tag tells us what the search terms
     were.</p>
-    <h3>Accessing Yioop via the Function API</h3>
+    <h4>Accessing Yioop via the Function API</h4>
     <p>The last way we will consider to get search results out of Yioop is
     via its function API. The Yioop Function API consists of the following
     three methods in controllers/search_controller.php :
@@ -3093,7 +3190,8 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     these methods as well as how to extract results from what is returned
     can be found in the file examples/search_api.php .</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='customizing'>Customizing Yioop</h2>
+    <h2 id="advanced-topics">Advanced Topics</h2>
+    <h3 id='customizing-code'>Modifying Yioop Code</h3>
     <p>One advantage of an open-source project is that you have complete
     access to the source code. Thus, Yioop can be modified to fit in
     with your existing project. You can also freely add new features onto
@@ -3105,7 +3203,7 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     as look at the <a href="http://www.seekquarry.com/yioop-docs/">online
     Yioop code documentation</a>.</p>

-    <h3>Handling new File Types</h3>
+    <h4>Handling new File Types</h4>
     <p>One relatively easy enhancement to Yioop is to enhance
     the way it processes an existing file type or to get it to process
     new file types. Yioop was written from scratch without dependencies
@@ -3136,7 +3234,7 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     <p>If your processor is cool, only relies on code you wrote, and you
     want to contribute it back to the Yioop, please feel free to
     e-mail it to chris@pollett.org .</p>
-    <h3>Writing an Indexing Plugin</h3>
+    <h4>Writing an Indexing Plugin</h4>
     <p>An indexing plugin provides a way that an advanced end-user
     can extend the indexing capabilities of Yioop. Bundled with
     Yioop is an example recipe indexing plugin which
@@ -3208,7 +3306,7 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     </pre>
     <p>This completes the discussion of how to write an indexing plugin.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='commandline'>Yioop Command-line Tools</h2>
+    <h3 id='commandline'>Yioop Command-line Tools</h3>
     <p>In addition to <a href="#token_tool">token_tool.php</a> which we
     describe in the section on localization, Yioop comes with several useful
     command-line tools and utilities. We next describe these in roughly
@@ -3221,11 +3319,14 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     WebArchiveBundle's and IndexArchiveBundles's</a></li>
     <li><a href="#query_tool">bin/query_tool.php: Used to query an index from
     the command-line</a></li>
-    <li><a href="#code_tool">bin/code _tool.php: Used to help code Yioop
+    <li><a href="#code_tool">bin/code_tool.php: Used to help code Yioop
     and to help make clean patches for Yioop.</a>
+    <li><a href="#classifier_tool">bin/classifier_tool.php: Used to make Yioop
+    a Yioop classifier from the command line rather than using the GUI
+    interface.</a>
     </li>
     </ul>
-    <h3 id="configure_tool">Configuring Yioop from the Command-line</h3>
+    <h4 id="configure_tool">Configuring Yioop from the Command-line</h4>
     <p>In a multiple queue server and fetcher setting, one might have web access
     only to the name server machine -- all the other machines might be on
     virtual private servers to which one has only command-line access. Hence,
@@ -3279,8 +3380,8 @@ Please choose an option:
     subsearch is unchecked. This means the RSS feeds won't be downloaded
     hourly on such machines. If one unchecks this, they can be.
     </p>
-    <h3 id="arc_tool">Examining the contents of WebArchiveBundle's and
-    IndexArchiveBundles's</h3>
+    <h4 id="arc_tool">Examining the contents of WebArchiveBundle's and
+    IndexArchiveBundles's</h4>
     <p>
     The command-line script bin/arc_tool.php can be used to examine the
     contents of a WebArchiveBundle or an IndexArchiveBundle. This tool gives
@@ -3397,7 +3498,7 @@ still has more than one tier (tiers are the result of incremental
 log-merges which are made during the crawling process). The
 mergetiers command merges these tiers into one large tier which is
 then usable by Yioop for query processing.<p>
-    <h3 id="query_tool">Querying an Index from the command-line</h3>
+    <h4 id="query_tool">Querying an Index from the command-line</h4>
 <p>The command-line script bin/query_tool.php can be used to query
 indices in the Yioop WORK_DIRECTORY/cache. This tool can be used
 on an index regardless of whether or not Apache is running. It can be
@@ -3451,7 +3552,7 @@ all of the Yioop meta words should work so you can do queries like
 "my_query i:timestamp_of_index_want". Query results depend on the
 kind of language stemmer/char-gramming being used, so French results might be
 better if one specifies fr-FR then if one relies on the default en-US.</p>
-<h3 id="code_tool"> A Tool for Coding and Making Patches for Yioop</h3>
+<h4 id="code_tool"> A Tool for Coding and Making Patches for Yioop</h4>
 <p>bin/code_tool.php can perform several useful tasks to help developers
 program for the Yioop environment. Below is a brief summary of its
 functionality:</p>
@@ -3492,6 +3593,49 @@ php code_tool.php replace path pattern replace_string effect</dt><dd>
     Prints all lines matching the regular expression pattern in the
     folder or file path.</dd>
 </dl>
+<h4 id='classifier_tool'>A Command-line Tool for making Yioop Classifiers</h4>
+bin/classifier_tool.php is used to automate the building and testing of
+classifiers, providing an alternative to the web interface when a labeled
+training set is available.
+</p>
+<p>
+classifier_tool.php takes an activity to perform, the name of a dataset to use,
+and a label for the constructed classifier. The activity is the name of one
+of the 'run*' functions implemented by this class, without the common 'run'
+prefix (e.g., 'TrainAndTest'). The dataset is specified as the common prefix
+of two indexes that have the suffixes "Pos" and "Neg", respectively.  So if
+the prefix were "DATASET", then this tool would look for the two existing
+indexes "DATASET Pos" and "DATASET Neg" from which to draw positive and
+negative examples. Each document in these indexes should be a positive or
+negative example of the target class, according to whether it's in the "Pos"
+or "Neg" index. Finally, the label is just the label to be used for the
+constructed classifier.
+</p>
+<p>
+Beyond these options (set with the -a, -d, and -l flags), a number of other
+options may be set to alter parameters used by an activity or a classifier.
+These options are set using the -S, -I, -F, and -B flags, which correspond
+to string, integer, float, and boolean parameters respectively. These flags
+may be used repeatedly, and each expects an argument of the form NAME=VALUE,
+where NAME is the name of a parameter, and VALUE is a value parsed according
+to the flag. The NAME should match one of the keys of the options member of
+this class, where a period ('.') may be used to specify nesting.  For
+example:
+</p>
+<pre>
+    -I debug=1         # set the debug level to 1
+    -B cls.use_nb=0    # tell the classifier to use Naive Bayes
+</pre>
+<p>
+To build and evaluate a classifier for the label 'spam', trained using the
+two indexes "DATASET Neg" and "DATASET Pos", and a maximum of the top 25
+most informative features:
+</p>
+<pre>
+php bin/classifier_tool.php -a TrainAndTest -d 'DATASET' -l 'spam'
+    -I cls.chi2.max=25
+</pre>
+
     <h2 id="references">References</h2>
     <dl>
 <dt id="APC2003">[APC2003]</dt>
@@ -3521,7 +3665,7 @@ In: Seventh International World-Wide Web Conference
 (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.</dd>
 <dt id='BCC2010'>[BCC2010]</dt>
 <dd>S. Büttcher, C. L. A. Clarke, and G. V. Cormack.
-<a href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307"
+<a href="http://mitpress.mit.edu/books/information-retrieval"
 >Information Retrieval: Implementing and Evaluating Search Engines</a>.
 MIT Press. 2010.</dd>
 <dt id="DG2004">[DG2004]</dt>
@@ -3534,6 +3678,13 @@ OSDI'04: Sixth Symposium on Operating System Design and Implementation. 2004<dd>
 <a href="http://research.google.com/archive/gfs-sosp2003.pdf
 ">The Google File System</a>.
 19th ACM Symposium on Operating Systems Principles. 2003.</dd>
+<dt id='GLM2007'>[GLM2007]</dt>
+<dd>
+A. Genkin, D. Lewis, and D. Madigan. <a
+href="http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf"
+>Large-scale bayesian logistic regressionfor text categorization</a>.
+Technometrics. Volume 49. Issue 3. pp. 291--304, 2007.
+</dd>
 <dt id='H2002'>[H2002]</dt>
 <dd>T. Haveliwala.
 <a href="
@@ -3570,6 +3721,14 @@ Cambridge University Press. 2008.</dd>
 <a href="http://iwaw.europarchive.org/04/Mohr.pdf"
 >Introduction to Heritrix, an archival quality web crawler</a>.
 4th International Web Archiving Workshop. 2004. </dd>
+<dt id="PTSHVC2011">[PTSHVC2011]</dt>
+<dd>Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon,
+Jeffrey Scott Vitter, Sabrina Chandrasekaran.
+<a href="http://www.cs.nthu.edu.tw/~wkhon/papers/PTSHVC11.pdf">Inverted indexes
+for phrases and strings</a>. Proceedings of the
+34nth Annual International ACM SIGIR Conference on Research
+and Development in Information Retrieval. pp 555--564. 2011.
+</dd>
 <dt id='P1997a'>[P1997a]</dt>
 <dd>J. Peek.
 Summary of the talk: <a href="
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index c474ed2..fce2c38 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -1,16 +1,43 @@
+<div>
 <h1>Downloads</h1>
 <h2>Yioop Releases</h2>
 <p>The two most recent versions of Yioop are:</p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/a=archive&amp;p=yioop
+&amp;h=f2c1e5fa9ee3dab2fe3d614c8fb07ee14982037e&amp;
+hb=18bba06ecc0804809bc494e0cc532d8ec69ab227&amp;t=zip"
+    >Version 0.96-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&amp;p=yioop
 &amp;h=714e33c174a3201c0b35118df05faeaccf71c34a&amp;
 hb=ba6ab2a825d58af3fa7465ae26bdc9e292a49468&amp;t=zip"
     >Version 0.941-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop
-&amp;h=da73fb8ad24ba67201a3cccaa6290d711f505ef3&amp;
-hb=fb79c4c0b11379bee3b8c4c803f9f938a9001c16&amp;t=zip"
-    >Version 0.921-ZIP</a></li>
 </ul>
+<h2 id='contribute'>Show Your Support</h2>
+<p>Seekquarry, LLC is a company owned by Chris Pollett,
+the principal developer of Yioop. If you like Yioop and would
+like to show support for this project, please
+consider making a contribution.</p>
+<div>
+<form action="https://www.paypal.com/cgi-bin/webscr" method="post"
+target="_top" style="float:left; margin-left:1.5in; margin-right:1.5in;">
+<div class="center">Paypal</div>
+<input type="hidden" name="cmd" value="_s-xclick" />
+<input type="hidden" name="hosted_button_id" value="3B94XKR9GTPNG" />
+<input type="image"
+    src="resources/btn_donateCC_LG.gif"
+    style="border:0" name="submit"
+    alt="PayPal - The safer, easier way to pay online!" />
+</form>
+<div>
+<p style="text-indent:0.2in">Flattr</p>
+<p style="position:relative;top:-0.1in"><a
+href="http://flattr.com/thing/1671104/SeekQuarryYioop"
+target="_blank"><img src="resources/flattr-badge-large.png"
+alt="Flattr this" title="Flattr this" border="0" /></a></p>
+</div>
+
+</div>
+
 <h2>Installation</h2>
 <p>The <a href="?c=main&amp;p=install">Install Guides</a>
 explain how to get Yioop to work in some common settings.
@@ -27,7 +54,18 @@ your old Yioop Installation. See the Installation section above for links
 to instructions on this, if you have forgotten how you did this.
 Knowing the old Work Directory location, should
 allow Yioop to complete the upgrade process.</p>
-<h2>Git Repository / Contributing</h2>
+<h2 id='consulting'>Consulting Services</h2>
+<p>
+Consulting services are available for Yioop. These can involve
+help with regard to installing, upgrading, or tuning Yioop. They can also
+involve paying for new features to be added to the next iteration of Yioop
+or for customizations not to be included in the main code base.</p>
+<p>Please
+<a href="mailto:chris@pollett.org">contact us for a quote</a> with a
+brief description of the services you need.
+</p>
+
+<h2>Git Repository / Contributing Code</h2>
 <p>The Yioop git repository allows anonymous read-only access. If you would
 like to contribute to Yioop, just do a clone of the most recent code,
 make your changes, do a pull, and make a patch. For example, to clone the
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index 7686569..76f4152 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -1,16 +1,17 @@
 <h1>Open Source Search Engine Software!</h1>
 <p>SeekQuarry is the parent site for <a href="http://www.yioop.com/">Yioop</a>.
 Yioop is a <a href="http://gplv3.fsf.org/">GPLv3</a>, open source, PHP search
-engine. Yioop can be configured as either a general purpose
-search engine for the whole web or it can be configured to provide search
-results for a set of urls or domains.
+engine.
 </p>
 <h2>Goals</h2>
 <p>Yioop was designed with the following goals in mind:</p>
 <ul>
 <li><b>Make it easier to obtain personal crawls of the web.</b> Only a web
 server such as Apache and PHP 5.3 or better is needed. Configuration can be
-done using a GUI interface.</li>
+done using a GUI interface. Yioop can be configured as either a general purpose
+search engine for the whole web or it can be configured to provide search
+results for a set of urls or domains. It can crawl a variety of
+file formats, and can be used as a news feed crawler.</li>
 <li><b>Support distributed crawling of the web, if desired.</b> To download
 many web pages quickly, it is useful to have more than one machine when crawling
 the web. If you have several machines at home, simply install the software
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index a1c7a65..e4dd424 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -27,16 +27,19 @@
     has all the terms will make it into the top ten. To keep things simple
     we will assume that the query is being performed on a single Yioop
     index rather than a crawl mix of several indexes. We will also ignore
-    how news feed search items get incorporated into results.
+    how news feed items get incorporated into results.
     </p>
-    <p>At its heart, Yioop currently relies on three main scores
+    <p>At its heart, Yioop relies on three main scores
     for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox).
     Proximity scores are only used if the query has two or more terms.
     We will describe later how these three scores are calculated.
     For now one can think that the Doc Rank roughly indicates how important
     the document as a whole is, Relevance measures how important the search
     terms are to the document, and Proximity measures how close the search terms
-    appear to each other on the document.
+    appear to each other on the document. In addition to these three basic
+    scores, a user might select when they perform a crawl that a
+    classifier be used for ranking purposes. After our initial discussion,
+    we will say how we incorporate classifier scores.
     </p>
     </p>
     On a given query, Yioop does not scan its whole posting lists to find
@@ -64,7 +67,19 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     Yioop computes the top ten of
     these `n` documents with respect to `mbox(RRF)(d)` and returns these
     documents.</p>
-    <p> To get a feeling for how the `mbox(RRF)(d)` formula works, consider some
+    <p>It is relatively straightforward to extend the `mbox(RRF)(d)` formula
+    to handle scores coming from classifiers: One just adds additional
+    reciprocal terms for each classifier score. For example, if
+    `mbox(CL)_1, ..., mbox(CL)_n` were the scores from the classifiers being used
+    for ranking, then the formula would become:</p>
+<p class="center">
+`mbox(RRF)(d) := 200(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} +
+frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
+    mbox(Rank)_(mbox(Prox))(d)} +
+    sum_{i=1}^nfrac{1}{59 + mbox(Rank)_(mbox(CL)_i)(d)}).`
+</p>
+    <p> To get a feeling for how the `mbox(RRF)(d)` formula works, let's
+    return to the non-classifiers case and consider some
     particular example situations:
     If a document ranked 1 with respect to each score, then
     `mbox(RRF)(d) = 200(3/(59+1)) = 10`.  If a document
@@ -130,8 +145,8 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     <p> Let's examine the fetcher's role in determining what terms get
     indexed, and hence, what documents can be retrieved using those
     terms. After receiving a batch of pages, the fetcher downloads pages in
-    batches of a hundred pages at a time. When the fetcher requests a URL for
-    download it sends a range request header asking for the first
+    batches of a hundred pages at a time. When the fetcher requests a
+    URL for download it sends a range request header asking for the first
     PAGE_RANGE_REQUEST (defaults to 50000) many bytes. Only the data in these
     byte has any chance of becoming terms which are indexed. The reason for
     choosing a fixed, relatively small size is so that one can index a large
@@ -142,7 +157,8 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     so after receiving the page, the fetcher discards any data after the first
     PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants
     that we mention such as PAGE_RANGE_REQUEST can be found in
-    configs/config.php .
+    configs/config.php. This particular constant can actually be set from
+    the admin panel under the Page Options - Crawl Time.
     For each page in the batch of a hundred urls downloaded, the
     fetcher proceeds through a sequence of processing steps to:</p>
     <ol>
@@ -150,6 +166,8 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     <li>Use the page processor to extract a summary for the document.</li>
     <li>Apply any indexing plugins for the page processor to generate
     auxiliary summaries and/or modify the extracted summary.</li>
+    <li>Run classifiers on the summary and add any class labels and rank scores
+    </li>
     <li>Calculate a hash from the downloaded page minus tags and
     non-word characters to be used for deduplication.</li>
     <li>Prune the number links extracted from the document down to
@@ -192,7 +210,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     href="?c=main&p=documentation#page-options">Page Options Section</a>
     of the Yioop documentation. Before describing how the
     "mini-inverted index" processing step is done, let's examine
-    Steps 1,2, and 5 above in a little more detail as they are very important
+    Steps 1,2, and 6 above in a little more detail as they are very important
     in determining what actually is indexed. Based usually on the
     the HTTP headers, a
     <a href="http://en.wikipedia.org/wiki/Internet_media_type">mimetype</a>
@@ -277,8 +295,8 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     useful text means that the link is more likely to be helpful to find
     the document.</p>
     <p>
-    Now that we have finished discussing Steps 1,2, and 5, let's describe what
-    happens when building a mini-inverted index. For the four to- five hundred
+    Now that we have finished discussing Steps 1,2, and 6, let's describe what
+    happens when building a mini-inverted index. For the four to five hundred
     summaries that we have at the start of mini-inverted index
     step, we make associative arrays of the form:
     </p>
@@ -292,16 +310,24 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
               ...)
     ...
     </pre>
-    <p>Term IDs are 8 byte strings consisting of the XOR of the two halves
-    of the 16 byte md5 hash of the term. Summary map numbers are
-    offsets into a table which can be used to look up a summary. These
-    numbers are in increasing order of when the page was put into the
-    mini-inverted index. To calculate a position of a term, the
+    <p>Term IDs are 20 byte strings. Terms might represent a single
+    word or might represent phrases. The first 8 bytes of a term ID is the
+    first 8 bytes of the md5 hash of the first word in the word or phrase.
+    The next byte is used to indicate whether the term is a word or a phrase.
+    If it is a word the remaining bytes are used to encode what kind of page
+    the word occurs of  (media:text, media:image, ... safe:true, safe:false, and
+    some classifier labels if relevant). If it is a phrase, the remaining
+    bytes encode various length hashes of the remaining words in the
+    phrase. Summary map numbers are offsets into a table which can be used to
+    look up a summary. These numbers are in increasing order of when the page
+    was put into the mini-inverted index. To calculate a position of a term, the
     summary is viewed as a single string consisting of
-    terms extracted from the url concatenated with the summary title
+    words extracted from the url concatenated with the summary title
     concatenated with the summary description. One counts
-    the number of terms from the start of this string. For example, suppose
-    we had two summaries:</p>
+    the number of words from the start of this string. Phrases start at the
+    position of their first word. Let's consider the case where we only
+    have words and no phrases and we are ignoring the meta word info
+    such as media: and safe:. Then suppose we had two summaries:</p>
     <pre>
     Summary 1:
     URL: http://test.yioop.com/
@@ -310,7 +336,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +

     Summary 2: http://test.yioop2.com/
     Title: Troll Story
-    Description: Once there was a lazy troll, P&amp;A, who lived on my
+    Description: Once there was a lazy troll, P&amp;A, who lived on my
         discussion board.
     </pre>
     <p>The mini-inverted index might look like:</p>
@@ -347,23 +373,62 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     are stemmed when put into the mini-inverted index.
     Also, observe acronyms, abbreviations, emails, and urls, such as
     P&amp;A, will be manipulated before being put into the index. For
-    some Asian languages such as Chinese where spaces might not be placed
+    some languages such as Japanese where spaces might not be placed
     between words, char-gramming is done instead. If two character
     char-gramming is used, the string:
-    您要不要吃? becomes 您要 要不 不要 要吃 吃? A user query 要不要 will,
-    before look-up, be converted to the conjunctive query 要不 不要 and so
-    would match a document containing 您要不要吃? Yioop can also be
-    <a href="?c=main&p=documentation#token_tool">configured to make use of a
-    Bloom filter</a> containing n-word grams for a language. This is typically
-    done for n-word grams coming from Wikipedia page titles. So for example,
-    if the document had "Rolling Stones" beginning at the position 7. This
-    would be recognized as an n-word gram in such a Bloom filter and
-    three terms would be extracted [roll stone] at position 7, [roll] at
-    position 7, and [stone] at position 8. In this way, a query for just
-    roll will match this document, as will one for just stone. On the other
-    hand, a query for rolling stones will also match and will make use of
-    the position list for [roll stone], so only documents with these two
-    terms adjacent would be returned.
+     源氏物語 (Tale of Genji) becomes 源氏 氏物 物語. A user query 源氏物 will,
+    before look-up, be converted to the conjunctive query 源氏 氏物 and so
+    would match a document containing 源氏物語.
+    </p>
+    <p>The effect of the meta word
+    portion of a term ID in the single word term case is to split the space of
+    documents containing a word like "dog" into disjoint subsets. This can
+    be used to speed up queries like "dog media:image", "dog media:video".
+    The media tag for a page can only be one of media:text, media:image,
+    media:video; it can't be more than one. A query of just "dog" will
+    actually be calculated as a disjoint union of the fixed, finitely many
+    single word term ID which begin with the same 8 bytes hash as "dog".
+    A query of "dog media:image" will do a look up all term IDs with the
+    same "dog" hash and "media:image" hash portion of the term ID. These
+    term IDs will correspond to disjoint sets of documents which are
+    process in order of doc offset.</p>
+    <p>Term IDs for phrases are used to speed up queries in the case
+    of multi-word queries. On a query like "earthquake soccer", Yioop uses
+    these term IDs to see how many documents have this exact phrase. If this
+    is greater than a threshold (10), Yioop just does an exact phrase
+    look up using these term IDs. If the number of query words is greater than
+    five, Yioop always uses this mechanism to do look up. Yioop does not store
+    phrase term IDs for every phrase it has ever found on some document in its
+    index. Instead, it follows the basic approach of
+    [<a href="#PTSHVC2011">PTSHVC2011</a>]. The main difference is that it
+    stores data directly in its inverted index rather than their two ID
+    approach. To get the idea of this approach, consider the stemmed
+    document:
+    </p>
+    <pre>
+jack be nimbl jack be quick jack jump the candlestick
+    </pre>
+    <p>The words that immediately follows each occurrence of "jack be" (nimbl,
+    quick) in this document are not all the same. Phrases with this property
+    are called <b>maximal</b>. The whole document
+    "jack be nimbl jack be quick jack jump the candlestick"
+    is also maximal and there is no prefix of it larger than "jack be" which
+    is maximal. We would call this string <b>conditionally maximal</b>
+    for "jack be". When processing a document, Yioop builds a
+    <a href="http://en.wikipedia.org/wiki/Suffix_tree">suffix tree</a> for
+    it in linear time using Ukkonen's algorithm [<a href="#U1995">U1995</a>].
+    It uses this tree to quickly build a list of maximal phrases of up to
+    12 words and any prefixes for which they are conditionally maximal.
+    Only such maximal phrases will be given term IDs and stored in the index.
+    The term ID for such a phrase begins with the 8 byte hash of the prefix
+    for which it is maximal. This is followed by hashes of various lengths
+    for the remaining terms. The format used is specified in the
+    documentation of utility.php's crawlHashPath function. To do an exact
+    lookup of a phrase like "jack be nimbl", it suffices to look up
+    phrase term IDs which have their first 8 bytes either the hash of
+    "jack", "jack be", or "jack be nimbl". Yioop only uses phrase term IDs
+    for lookup of documents not for calculations like proximity where it uses
+    the actual words that make up the phrase to get a score.
     </p>
     <p>It should be recalled that links are treated as their own little
     documents and so will be treated as separate documents when making the
@@ -422,8 +487,8 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     <dt>posting_doc_shards</dt><dd>This contains a sequence of
     inverted index files, shardNUM, called IndexShard's. shardX holds the
     postings lists for the Xth block of NUM_DOCS_PER_GENERATION many
-    summaries. NUM_DOCS_PER_GENERATION default to 50000 if the queue server is
-    on a machine with at least 1Gb of memory. shardX also has postings for the
+    summaries. NUM_DOCS_PER_GENERATION default to 40000 if the queue server is
+    on a machine with at least 2Gb of memory. shardX also has postings for the
     link documents that were acquired while acquiring these summaries.</dd>
     <dt>generation.txt</dt><dd>Contains a serialized PHP object which
     says what is the active shard -- the X such that shardX will receive
@@ -511,7 +576,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     use a simple queue. This would yield roughly a breadth-first traversal of
     the web starting from the seed sites. Since high quality pages are often a
     small number of hops from any page on the web, there is some evidence
-    [<a href="NW2001">NW2001</a>] that this lazy strategy is not too
+    [<a href="#NW2001">NW2001</a>] that this lazy strategy is not too
     bad for crawling  according to document importance. However, there
     are better strategies. When Page Importance is chosen in the
     Crawl Order dropdown for a Yioop crawl, the Scheduler on each queue server
@@ -888,6 +953,15 @@ results.</p>
 In: Proceedings of the 12th international conference on World Wide Web.
 pp. 280-290. 2003.
 </dd>
+
+<dt id="BY2008">[BY2008]</dt>
+<dd>A. M. Z. Bidoki and Nasser Yazdani.
+<a href="http://goanna.cs.rmit.edu.au/~aht/tiger/DistanceRank.pdf"
+>DistanceRank: An intelligent ranking algorithm for web pages</a>.
+Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892.
+March, 2008.
+</dd>
+
 <dt id='BP1998'>[BP1998]</dt>
 <dd>Brin, S. and Page, L.
 <a  href="http://infolab.stanford.edu/~backrub/google.html"
@@ -919,13 +993,7 @@ and Development in Information Retrieval. pp.758--759. 2009.
 ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
 </dd>

-<dt id="BY2008">[BY2008]</dt>
-<dd>A. M. Z. Bidoki and Nasser Yazdani.
-<a href="http://goanna.cs.rmit.edu.au/~aht/tiger/DistanceRank.pdf"
->DistanceRank: An intelligent ranking algorithm for web pages</a>.
-Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892.
-March, 2008.
-</dd>
+

 <dt id="NW2001">[NW2001]</dt>
 <dd>Marc Najork and Janet L. Wiener.
@@ -936,6 +1004,21 @@ Proceedings of the 10th international conference on World Wide Web.
 pp 114--118. 2001.
 </dd>

+<dt id="PTSHVC2011">[PTSHVC2011]</dt>
+<dd>Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon,
+Jeffrey Scott Vitter, Sabrina Chandrasekaran.
+<a href="http://www.cs.nthu.edu.tw/~wkhon/papers/PTSHVC11.pdf">Inverted indexes
+for phrases and strings</a>. Proceedings of the
+34nth Annual International ACM SIGIR Conference on Research
+and Development in Information Retrieval. pp 555--564. 2011.
+</dd>
+
+<dt id="U1995">[U1995]</dt>
+<dd>Ukkonen, E. <a
+href="http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf">On-line
+construction of suffix trees</a>.
+Algorithmica. Vol. 14. Iss 3. pp. 249--260. 1995.</dd>
+
 <dt id="VLZ2012">[VLZ2012]</dt>
 <dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel.
 <a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf"
ViewGit