Version 0.98 of documentation

Chris Pollett [2013-12-01 18:Dec:st]
Version 0.98 of documentation
Filename
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/ranking.thtml
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 831f931..4441fe5 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
-<div class="docs">
-<h1>Yioop Documentation v 0.96</h1>
+Activity element looks like<div class="docs">
+<h1>Yioop Documentation v 0.98</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
     <li><a href="#overview"><b>Overview</b></a>
@@ -25,7 +25,7 @@
         <li><a href="#settings-signin">Settings and Signin</a></li>
         <li><a href="#mobile">Mobile Interface</a></li>
         <li><a href="#passwords">Managing Accounts</a></li>
-        <li><a href="#userroles">Managing Users and Roles</a></li>
+        <li><a href="#userrolegroups">Managing Users, Roles, and Groups</a></li>
     </ul>
     </li>
     <li><a href="#crawl-results"><b>Crawling and Customizing Results</b></a>
@@ -58,7 +58,7 @@
 <h2 id="overview">Overview</h2>
     <h3 id="quick">Getting Started</h3>
     <p>This document serves as a detailed reference for the
-    Yioop search engine. If you want to get started using Yioop now,
+    Yioop search engine. If you want to get started using Yioop now,
     you probably want to first read the
     <a href="?c=main&p=install">Installation
     Guides</a> page. If you cannot find your particular machine configuration
@@ -509,6 +509,7 @@
     supports anchor tags with rel="nofollow"
     attributes. It also supports X-Robots-Tag HTTP headers.</li>
     <li>Yioop has its own DNS caching mechanism.</li>
+    <li>Yioop supports crawling TOR networks (.onion urls).</li>
     <li>Yioop supports crawl quotas for web sites. I.e., one can control
     the number of urls/hour downloaded from a site.</li>
     <li>Yioop can detect website congestion and slow down crawling
@@ -609,6 +610,11 @@ files and both of these must be changed.</p>
 <pre>
 ini_set("memory_limit","1800M");
 </pre>
+    <p>For the index.php file, you may need to set the limit at well in
+    your php.ini file for the instance of PHP used by your web server. If
+    the value is too low for the index.php Web app you might see messages
+    in the Fetcher logs that begin with:
+    "Trouble sending to the scheduler at url..."</p>
     <p>
     Often in a VM setting these requirements are somewhat steep. It is possible
     to get Yioop to work in environments like EC2 (be aware this might
@@ -989,7 +995,7 @@ calls the render methods of the current View, and finally outputs scripts and
 the necessary closing document tags.
 </dd>
 </dl>
-In addition, to the Yioop application folder, Yioop makes use of a
+In addition to the Yioop application folder, Yioop makes use of a
 WORK DIRECTORY. The location of this directory is set during the configuration
 of a Yioop installation. Yioop stores crawls, and other data local
 to a particular Yioop installation in files and folders in this directory.
@@ -1483,17 +1489,26 @@ Yioop from a mobile device.
 <img src='resources/ChangePassword.png' alt='Change Password Form'/>

     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h3 id='userroles'>Managing Users and Roles</h3>
-    <p>The manage user and manage role activities have similar looking
-    forms. The Manage User activity looks like:</p>
+    <h3 id='userrolegroups'>Managing Users, Roles, and Groups</h3>
+    <p>The manage user, manage group, and manage role activities have similar
+    looking forms as well as related functions. Users are people who have
+    accounts to connect with a Yioop installation. Users, once logged in
+    may engage in various Yioop activities such as Manage Crawls, Mix Crawls,
+    and so on. A user is not directly assigned which activities they have
+    permissions on. Instead, they derive their permissions from which roles
+    they have been directly assigned and by which groups they belong to.
+    The Manage User activity looks
+    like:</p>
 <img src='resources/ManageUser.png' alt='The Manage User form'/>
-    <p>As one can see this activity has three forms associated with it.
+    <p>As one can see this activity has three forms associated with it:
     The first form can be used to add a new user with a given password
     to the Yioop system. The second form allows existing users to be deleted.
     The last form allows one to add roles to or delete roles from an existing
     user. Here the word "role" means a set of activities.
-    Adding a role to an user means allows that
-    user when signed in to the admin panel can access that activity. Roles
+    Adding a role to a user allows that
+    user when signed in to the admin panel to carry out any activity
+    in the role.</p>
+    <p>Roles
     are managed through the Manage Role activity, which looks like:</p>
 <img src='resources/ManageRole.png' alt='The Manage Role form'/>
    <p>
@@ -1501,6 +1516,24 @@ Yioop from a mobile device.
    an existing role, and finally choose an existing role and add or delete
    activities from it.
    </p>
+    <p>Groups are collections of users that have access to a set of roles.
+    Groups are managed through the Manage Groups activity which looks like:</p>
+<img src='resources/ManageGroups.png' alt='The Manage Group form'/>
+    <p>The first two forms in this activity allow one to create an empty group
+    and to delete a group. Selecting a group in the View Groups drop down
+    displays two more drop downs: Add User, which allows one to
+    add more users to the selected group; and Add Role which allows one to add
+    roles to the group. As users are added to the group they appear in
+    a table below the Add User dropdown. By default, the group is populated
+    with at least the name of the user who created the group. This user
+    is the so-called admin user -- the only user with permissions to
+    change group members. Next to the admin user is a Transfer Admin
+    link that let's the admin give the admin priviliege's to someone else.
+    Next to other users added to a group is a delete link which can be
+    used to remove them from the group. As roles are added to the group
+    they appear in a table beneath the Add Role dropdown. Again, there
+    is a Delete link listed next to each added role which can be used to
+    delete that role.</p>

     <p><a href="#toc">Return to table of contents</a>.</p>

@@ -1681,7 +1714,10 @@ php fetcher.php start 5
     >MediaWiki xml dump</a>,
     <a href="http://rdf.dmoz.org/"
     >Open Directory Project RDF file</a>, . In the next subsection, we describe
-    new web crawls and then return to archive crawls subsection after that.</p>
+    new web crawls and then return to archive crawls subsection after that.
+    Finally, we have a short section on some advanced crawl options which can
+    only be set in config.php or local_config.php. You will probably not need
+    these features but we mention them for completeness</p>
     <h5>Web Crawl Options</h5>
     <p>
     On the web crawl tab, the first form field, "Get Crawl Options From",
@@ -1714,7 +1750,9 @@ php fetcher.php start 5
     good robots.txt file, but will ban you from interacting with their site
     if they get too much traffic from you. The Seed sites textarea allows
     you to specify a list of urls that the crawl should start from. The
-    crawl will begin using these urls.
+    crawl will begin using these urls. This list can include ".onion" urls
+    if you want to crawl <a href="http://en.wikipedia.org/wiki/Tor_network"
+    >TOR networks</a>.
     </p>
     <p>
     The format for sites, domains, and urls are the same for each of these
@@ -1889,7 +1927,9 @@ column_separator = '##';
     where the arc_type is TextArchiveBundle. First, suppose we wanted
     to index access log file records that look like:</p>
     <pre>
-127.0.0.1 - - [21/Dec/2012:09:03:01 -0800] "POST /git/yioop2/ HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; YioopBot; +http://localhost/git/yioop/bot.php)"
+127.0.0.1 - - [21/Dec/2012:09:03:01 -0800] "POST /git/yioop2/ HTTP/1.1" 200 - \
+    "-" "Mozilla/5.0 (compatible; YioopBot; \
+    +http://localhost/git/yioop/bot.php)"
     </pre>
     <p>Here each record is delimited by a newline and the character encoding is
     UTF-8. The records are stored in
@@ -1967,7 +2007,34 @@ encoding = "ASCII";
     .arc.gz; for MediaWiki, files of extension .xml.bz2;
     and for ODP-RDF, files of extension .rdf.u8.gz .
     </p>
-
+    <h5>Crawl Options of config.php or local_config.php</h5>
+    <p>There are a couple of flags which can be set in the config.php
+    or in a local_config.php file that affect web crawling which we now
+    mention for completeness. As was mentioned before, when Yioop is crawling
+    it makes use of Etag: and Expires: HTTP headers received during web page
+    download to determine when a page can be recrawled. This assumes
+    one has not completely turned off recrawling under the
+    <a href="#page-options">Page
+    Indexing and Search Options activity</a>. To turn Etag and Expires
+    checking off, one can add to a local_config.php file the line:
+    </p>
+    <pre>
+define("USE_ETAG_EXPIRES", false);
+    </pre>
+    <p>Yioop can be run using the <a
+    href="https://github.com/facebook/hhvm/">Hip Hop Virtual Machine from
+    FaceBook</a>. This will tend to make Yioop run faster and use less memory
+    than running it under the standard PHP interpreter. Hip Hop can be used on
+    various Linux flavors and to some  degree runs under OSX (the queue server
+    and fetcher will run, but the web app doesn't). If you want to use the
+    Hip Hop  on Mac OSX, and if you install it via Homebrew,
+    then you will need to set a force variable and set the path for Hip Hop in
+    your local_config.php  file with lines like:</p>
+    <pre>
+define('FORCE_HHVM', true);
+define('HHVM_PATH', '/usr/local/bin');
+    </pre>
+    <p>The above lines are only needed on OSX to run Hip Hop.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h3 id='mixes'>Mixing Crawl Indexes</h3>
     <p>Once you have performed a few crawls with Yioop, you can use the Mix
@@ -2273,16 +2340,39 @@ encoding = "ASCII";
     </p>
     <p>
     The Indexing Plugins checkboxes allow you to select which plugins
-    to use during the crawl. For instance,
-    clicking the RecipePlugin checkbox would cause Yioop to run the code
-    in indexing_plugins/recipe_plugin.php. This code tries to detect pages
+    to use during the crawl. Yioop comes with two built-in plugins:
+    a WordFilterPlugin and RecipePlugin. One can also write or downlaod
+    additional plugins. If the plugin can be configured,
+    next to the checkbox will be a link to a configuration screen. For example,
+    clicking the RecipePlugin checkbox causes Yioop during a crawl to run the
+    code in indexing_plugins/recipe_plugin.php. This code tries to detect pages
     which are food recipes and separately extracts these recipes and clusters
-    them by ingredient. The extract recipe pages is done by the pageProcessing
-    callback in the RecipePlugin class of recipe_plugin.php; the clustering
-    is done in RecipePlugin's postProcessing method. The first method is
-    called by Yioop for each active plugin on each page downloaded. The second
-    method is called during the stop crawl process of Yioop.
+    them by ingredient. It then add search meta words ingredient: and
+    recipe:all to allow one to search recipes by ingredient or only documents
+    containing recipes.  Checking the WordFilterPlugin causes Yioop to run
+    code in indexing_plugins/wordfilter_plugin.php on each downloaded page.
+    This code checks if the downloaded page has one of the words listed
+    in the textarea one finds on the plugin's configure page. If it does,
+    then the plugin follows the actions listed for pages that contain that
+    term. Below is an example WordFilterPlugin configure page:
     </p>
+    <img src="resources/WordFilterConfigure.png"
+        alt="Word Filter Configure Page" />
+    <p>Each line in the textarea consists of a word followed by a colon
+    followed by a comma separated list of what to do if that word is seen.
+    The line term0:NOTCONTAIN,JUSTFOLLOW says that if the downloaded page
+    does not contain the word "term0" then do not index the page, but do
+    follow outgoing links from the page. The line term1:NOPROCESS says
+    if the document has the word "term1" then do not index it or follow links
+    from it. The last line term2:NOFOLLOW,NOSNIPPET says if the
+    page contains "term2" then do not follow any outgoing links. NOSNIPPET
+    means that if the page is returned from search results, the link to
+    the page should not have a snippet of text from that page beneath it.
+    In addition, to the commands just mentioned, WordFilterPlugin supports
+    standard robots.txt directives such as: NOINDEX, NOCACHE,
+    NOARCHIVE, NOODP, NOYDIR, and NONE. More details about how indexing
+    plugins work and how to write your own indexing plugin can be
+    found in the <a href="#customizing-code">Modifying Yioop</a> section.</p>
     <h4 id='extraction'>Page Field Extraction Language</h4>
     <p>We now return to the Page Field Extraction Rules textarea of
     the Page Options - Crawl Time tab. Commands
@@ -2727,9 +2817,9 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
     var $view = array("search");
     </pre>
     <p>
-    Then Yioop would first look for a file: APP_DIR/models/search_view.php
+    Then Yioop would first look for a file: APP_DIR/views/search_view.php
     to include, if it cannot find such a file then it tries to include
-    BASE_DIR/models/search_view.php. So to change the behavior of an existing
+    BASE_DIR/views/search_view.php. So to change the behavior of an existing
     BASE_DIR file one just has a modified copy of the file in the appropriate
     place in your APP_DIR. This holds in general for other program files
     such as views and plugins. It doesn't hold for resources such as images --
@@ -3238,9 +3328,12 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     <h4>Writing an Indexing Plugin</h4>
     <p>An indexing plugin provides a way that an advanced end-user
     can extend the indexing capabilities of Yioop. Bundled with
-    Yioop is an example recipe indexing plugin which
-    can serve as a guide for writing your own plugin. It is
-    found in the folder lib/indexing_plugins. This recipe
+    Yioop are two example indexing plugins. These are found in the
+    lib/indexing_plugins folder. If you decide to write your own plugin or
+    want to install a third-party plugin you can put it in the folder:
+    WORK_DIRECTORY/app/lib/indexing_plugins. The recipe indexing plugin
+    can serve as a guide for writing your own plugin if you don't need
+    your plugin to have a configure screen.  The recipe
     plugin is used to detect food recipes which occur on pages during a crawl.
     It creates "micro-documents" associated with found recipes. These
     are stored in the index during the crawl under the meta-word "recipe:all".
@@ -3258,12 +3351,13 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     written using indexing plugins. To make your own plugin, you
     would need to write a subclass of the class IndexingPlugin with a
     file name of the form mypluginname_plugin.php. Then you would need
-    to put this file in the folder lib/indexing_plugins. In the file
-    configs/config.php you would need to add the string "mypluginname" to
-    the array $INDEXING_PLUGINS. To properly subclass IndexingPlugin,
-    your class needs to implement four methods:
+    to put this file in the folder WORK_DIRECTORY/app/lib/indexing_plugins.
+    RecipePlugin subclasses IndexingPlugin and implements
+    the following four methods:
     pageProcessing($page, $url), postProcessing($index_name),
-    getProcessors(), getAdditionalMetaWords(). If your plugin needs
+    getProcessors(), getAdditionalMetaWords() so they don't have their
+    return NULL  default behavior. We explain what each of these
+    is for in a moment. If your plugin needs
     to use any page processor or model classes, you should modify the
     $processors and $model instance array variables of your plugin to
     list the ones you need. During a web crawl, after a fetcher has downloaded
@@ -3305,7 +3399,58 @@ var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
     array("recipe:" => HtmlProcessor::MAX_DESCRIPTION_LEN,
             "ingredient:" => HtmlProcessor::MAX_DESCRIPTION_LEN);
     </pre>
-    <p>This completes the discussion of how to write an indexing plugin.</p>
+    <p>The WordFilterPlugin illustrates how one can write an indexing
+    plugin with a configure screen. It overrides the base class'
+    pageSummaryProcessing(&$summary) and getProcessors() methods as well as
+    implements the  methods saveConfiguration($configuration),
+    loadConfiguration(),
+    setConfiguration($configuration), configureHandler(&$data), and
+    configureView(&$data). The purpose of getProcessors() was already
+    mentioned under recipe plugin description above.
+    pageSummaryProcessing(&$summary) is called by a page processor after
+    a page has been processed and a summary generated. WordFilterPlugin
+    uses this callback to check if the title or the description in
+    this summary have any of the words the filter is filtering for and if
+    so takes the appropriate action. loadConfiguration,
+    saveConfiguration($configuration), and setConfiguration
+    are three methods to handle persistence for any plugin data that the user
+    can change. The first two operate on the name server, the last might
+    operate on a queue_server or a fetcher. loadConfiguration is  be called
+    by configureHandler(&$data) to read in any current configuration,
+    unserialize it and modify it according to any data sent by the user.
+    saveConfiguration($configuration) would then be called by
+    configureHandler(&$data) to
+    serialize and  write any $configuration data that needs to be
+    stored by the plugin.  For WordFilterPlugin, a list of filter terms
+    and actions are what is saved by saveConfiguration($configuration) and
+    loaded by loadConfiguration. When a crawl is started or when a fetcher
+    contacts the name server, plugin configuration data is sent by the name
+    server. The method setConfiguration($configuration) is used to initialize
+    the local copy of a fetcher's or queue_server's process with the
+    configuration settings from the name server. For WordFilterPlugin,
+    the filter terms and actions are stored in a field variable by this
+    function.</p>
+    <p>As has already been hinted at by the configuration discussion above,
+    configureHandler(&$data) plays the role of a controller for an
+    index plugin. It is in fact called by the AdminController activity
+    pageOptions if the configure link for a plugin has been
+    clicked. In addition, to managing the load and save configuration process,
+    it also sets up any data needed by configureView(&$data).
+    For WordFilterPlugin, this involves setting a variable $data["filter_words"]
+    so that configureView(&$data) has access to a list of filter words
+    and actions to draw. Finally, the last method of the WordFilterPlugin
+    we describe, configureView(&$data), outputs using $data
+    the HTML that will be seen in the configure screen. This HTML will
+    appear in a div tag on the final page. It is initially styled so that
+    it is not displayed. Clicking on the configure link will cause the
+    div tag data to be displayed in a light box in the center of the screen.
+    For WordFilterPlugin, this methods draws a title and textarea form with the
+    currently filtered terms in it. It makes use of Yioop's tl() functions so
+    that the text of the title can be localized to different languages.
+    This form has hidden field c=admin, a=pageOptions option-type=crawl_time,
+    so that hte AdminController will know to call pageOption and pageOption
+    will know in turn to let plugin's configureHandler methods to get a chance
+    to handle this data.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h3 id='commandline'>Yioop Command-line Tools</h3>
     <p>In addition to <a href="#token_tool">token_tool.php</a> which we
@@ -3384,36 +3529,47 @@ Please choose an option:
     <h4 id="arc_tool">Examining the contents of WebArchiveBundle's and
     IndexArchiveBundles's</h4>
     <p>
-    The command-line script bin/arc_tool.php can be used to examine the
-    contents of a WebArchiveBundle or an IndexArchiveBundle. This tool gives
-    a print out of the web pages or summaries contained in such bundles. It can
-    also be used to give information from the headers of these bundles. Finally,
-    it can be used to re-index an IndexArchiveBundle's dictionary based
-    on the contents of the partial dictionaries in each of the bundles
-    posting_doc_shards. arc_tool is run from the command-line with the syntaxes:
+    The command-line script bin/arc_tool.php can be used to examine and
+    manipulate the contents of a WebArchiveBundle or an IndexArchiveBundle.
+    Below is a summary of the different command-line uses of
+    arc_tool.php:
     </p>
-    <pre>
-php arc_tool.php info bundle_name //return info about
-//documents stored in archive.
+    <dl>
+<dt>php arc_tool.php dict bundle_name word</dt>
+    <dd>returns index dictionary records for word stored in index archive
+    bundle.</dd>
+
+<dt>php arc_tool.php info bundle_name</dt>
+    <dd>return info about documents stored in archive.</dd>

-php arc_tool.php list //returns a list
-//of all the archives in the Yioop crawl directory, including
-//non-Yioop archives in the cache/archives sub-folder.
+<dt>php arc_tool.php list</dt>
+    <dd>returns a list of all the archives in the Yioop! crawl directory,
+       including non-Yioop! archives in the /archives sub-folder.</dd>
+
+<dt>php arc_tool.php mergetiers bundle_name max_tier</dt>
+    <dd>merges tiers of word dictionary into one tier up to max_tier</dd>
+
+<dt>php arc_tool.php posting bundle_name generation offset<br />
+&nbsp;&nbsp;&nbsp;&nbsp;or<br />
+<dt>php arc_tool.php posting bundle_name generation offset num
+    <dd>returns info about the posting (num many postings) in bundle_name at
+       the given generation and offset</dd>

-php arc_tool.php mergetiers bundle_name max_tier
-//merges tiers of word dictionary into one tier up to max_tier
+<dt>php arc_tool.php rebuild bundle_name</dt>
+    <dd>Re-extracts words from summaries files in bundle_name into index shards
+        then builds a new dictionary</dd>

-php arc_tool.php reindex bundle_name
-//reindex the word dictionary in bundle_name
+<dt>php arc_tool.php reindex bundle_name</dt>
+    <dd>Reindex the word dictionary in bundle_name using existing index shards

-php arc_tool.php shard bundle_name generation
-//Prints information about the number of words and frequencies of words
-// within the generation'th index shard in the bundle
+<dt>php arc_tool.php shard bundle_name generation</dt>
+    <dd>Prints information about the number of words and frequencies of words
+       within the generation'th index shard in the bundle</dd>

-php arc_tool.php show bundle_name start num //outputs
-//items start through num from bundle_name
-//or name of non-Yioop archive crawl folder.
-   </pre>
+<dt>php arc_tool.php show bundle_name start num</dt>
+    <dd>outputs items start through num from bundle_name or name of
+       non-Yioop archive crawl folder</dd>
+   </dl>
    <p>The bundle name can be a full path name, a relative path from
    the current directory, or it can be just the bundle directory's file
    name in which case WORK_DIRECTORY/cache will be assumed to be the
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index c17049e..ac8857d 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -4,13 +4,13 @@
 <p>The two most recent versions of Yioop are:</p>
 <ul>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&amp;
+p=yioop&amp;h=bf48e34ffa1fc707b13107a4981e4ef9c6952048&amp;
+hb=f5fe1a90318da6cf242484f15bb3661b02d64fca&amp;t=zip"
+    >Version 0.98-ZIP</a></li>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&amp;
 p=yioop&amp;h=71864bbd75c0a877c10e97840e9b628a6c7ad416&amp;
 hb=f32456c48a7f9d2f03374ae37f695fd9492191a5&amp;t=zip"
     >Version 0.961-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&amp;p=yioop
-&amp;h=714e33c174a3201c0b35118df05faeaccf71c34a&amp;
-hb=ba6ab2a825d58af3fa7465ae26bdc9e292a49468&amp;t=zip"
-    >Version 0.941-ZIP</a></li>
 </ul>
 <h2 id='contribute'>Show Your Support</h2>
 <p>Seekquarry, LLC is a company owned by Chris Pollett,
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 9741843..72ce53f 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -20,7 +20,9 @@
     A typical query to Yioop is a collection of terms without the use
     of the OR operator, '|', or the use of the exact match operator, double
     quotes. On such a query, called a <b>conjunctive query</b>,
-    Yioop tries to return documents which contain all of the query terms.
+    Yioop tries to return documents which contain all of the query terms.
+    If the sequence of words is particular common, Yioop will try to return
+    results which have that string with the same word order.
     Yioop further tries to return these documents in descending order of score.
     Most users only look at the first ten of the results returned. This article
     tries to explain the different factors which influence whether a page that
@@ -44,10 +46,12 @@
     </p>
     On a given query, Yioop does not scan its whole posting lists to find
     every document that satisfies the query. Instead, it scans until it finds
-    a fixed number of documents, say `n`, satisfying the query. It then
-    computes the three scores for each of these `n` documents. For a document
-    `d` from these `n` documents, it determines the rank of `d` with respect to
-    the Doc Rank score, the rank of `d` with respect to the Relevance score,
+    a fixed number of documents, say `n`, satisfying the query or until
+    a timeout is exceeded. In the case of a timeout, `n` is just the number
+    of documents found by the timeout. It then computes the three scores for
+    each of these `n` documents. For a document `d` from these `n` documents, it
+    determines the rank of `d` with respect to the Doc Rank score, the rank
+    of `d` with respect to the Relevance score,
     and the rank of `d` with respect
     to the Proximity score. It finally computes a score for each  of these
     `n` documents using these three rankings and
@@ -392,16 +396,24 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
     same "dog" hash and "media:image" hash portion of the term ID. These
     term IDs will correspond to disjoint sets of documents which are
     process in order of doc offset.</p>
-    <p>Term IDs for phrases are used to speed up queries in the case
-    of multi-word queries. On a query like "earthquake soccer", Yioop uses
-    these term IDs to see how many documents have this exact phrase. If this
-    is greater than a threshold (10), Yioop just does an exact phrase
-    look up using these term IDs. If the number of query words is greater than
-    five, Yioop always uses this mechanism to do look up. Yioop does not store
+    <p>In the worst case to do a conjunctive query takes time proportional
+    to the shortest posting list. To try to get a better guarantee on the
+    runtime of queries, Yioop ties to use Term IDs for phrases are used to
+    speed up queries in the case of multi-word queries. On a query like
+    "earthquake soccer", Yioop uses these term IDs to see how many documents
+    have this exact phrase. If this is greater than a threshold (10), Yioop
+    just does an exact phrase look up using these term IDs. If the number of
+    query words is greater than three, Yioop always uses this mechanism to do
+    look up. If the threshold is not met, Yioop checks if the threshold is met
+    by all, but the last word, or by all but the first word. If so, it does
+    the simpler conjective queries of the phrase plus the single word.
+    </p>
+    <P>
+    Yioop does not store
     phrase term IDs for every phrase it has ever found on some document in its
     index. Instead, it follows the basic approach of
-    [<a href="#PTSHVC2011">PTSHVC2011</a>]. The main difference is that it
-    stores data directly in its inverted index rather than their two ID
+    [<a href="#PTSHVC2011">PTSHVC2011</a>]. The main difference is that it
+    stores data directly in its inverted index rather than their two ID
     approach. To get the idea of this approach, consider the stemmed
     document:
     </p>
@@ -743,11 +755,16 @@ that we did after generating summaries to extract terms.</li>
 </ol>
 <p>After going through the above steps, Yioop builds an iterator
 object from the resulting terms to iterate over summaries and link
-entries that contain all of the terms. In the single queue server setting
+entries that contain all of the terms. As described in the section
+<a href="#fetchers">Fetchers and their Effect on Search Ranking</a>, some or all
+of these terms might be whole phrases to reduce the need for computing expensive
+conjunctive queries. In the single queue server setting
 one iterator would be built for each term and these iterators
 would be added to an intersect iterator that would return documents
-on which all the terms appear. These iterators are then fed into
-a grouping iterator, which groups links and summaries that refer
+on which all the terms appear. This intersect iterator has a timer associated
+with it to prevent it from running too long in the case of a conjunctive query
+of terms with long posting lists with small intersection. These iterators are
+then fed into a grouping iterator, which groups links and summaries that refer
 to the same document url. Recall that after downloading pages on the fetcher,
 we calculated a hash from the downloaded page minus tags. Documents
 with the same hash are also grouped together by the group iterator.
ViewGit