Version 0.70 of the documentation, a=chris

Chris Pollett [2011-07-30 23:Jul:th]

Version 0.70 of the documentation, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index d9bca05..f486d48 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -31,5 +31,15 @@ Several people helped
 with localization: Mary Pollett, Thanh Bui, Youn Kim, Sugi Widjaja,
 Chao-Hsin Shih, Sujata Dongre, and Jonathan Ben-David. Thanks to
 Ravi Dhillon for finding and helping with the fixes for Issue 15
-and Commit 632e46.
+and Commit 632e46. Several of my master's students have done projects
+related to Yioop!: Amith Chandranna, Priya Gangaraju, and Vijaya Pamidi.
+Amith's code related to an Online version of the HITs algorithm
+is not currently in the main branch of Yioop!, but it is
+obtainable from <a href="http://www.cs.sjsu.edu/faculty/pollett/masters/<?php
+?>Semesters/Spring10/amith/index.shtml">Amith Chandranna's student page</a>.
+Vijaya developed a Firefox web traffic extension for Yioop!
+Her code is also obtainable from <a href="http://www.cs.sjsu.edu/faculty/<?php
+?>pollett/masters/Semesters/Fall10/vijaya/index.shtml">Vijaya Pamidi's
+master's pages</a>. Priya's code served as the
+basis for the plugin feature currently in Yioop!
 </p>
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index b6a6bc0..382632e 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,18 +1,18 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.68</h1>
+<h1>Yioop! Documentation v 0.70</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
         <li><a href="#required">Requirements</a></li>
         <li><a href="#installation">Installation and Configuration</a></li>
         <li><a href="#files">Summary of Files and Folders</a></li>
-        <li><a href="#interface">The Yioop! Search and User  Interface</a></li>
+        <li><a href="#interface">The Yioop! Search and User Interface</a></li>
         <li><a href="#passwords">Managing Accounts</a></li>
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
         <li><a href="#localizing">Localizing Yioop! to a New Language</a></li>
-        <li><a href="#hacking">Hacking Yioop!</a></li>
+        <li><a href="#hacking">Customizing Yioop!</a></li>
         <li><a href="#references">References</a></li>
     </ul>

@@ -28,7 +28,7 @@
     exist today, how Yioop! fits into this eco-system, and when Yioop!
     might be the right choice for your search engine needs. In the remainder
     of this document after the introduction, we will discuss how to get
-    and install Yioop!, the files and folders used in the Yioop!,
+    and install Yioop!, the files and folders used in Yioop!,
     user, role, and crawl management in the Yioop! system, localization in
     the Yioop! system, and finally hacking Yioop!
     </p>
@@ -37,7 +37,7 @@
     in understanding Yioop! capabilities.</p>
     <p>In 1994, Web Crawler, one of the earliest
     still widely-known search engines, only had an
-    index of about 50,000 pages which were maintained in an Oracle database.
+    index of about 50,000 pages which was stored in an Oracle database.
     Today, databases are still used to create indexes for small to medium size
     sites. An example of such a search engine written in PHP is
     <a href="http://www.sphider.eu/">Sphider</a>. Given that a database is
@@ -90,7 +90,7 @@
     web adjacency matrix to an initial guess of the page ranks. This problem
     naturally decomposes into rounds. Within a round the Google matrix is
     applied to the current page ranks estimates of a set of sites. This
-    operation is reasonable easy to distribute to many machines. Computing how
+    operation is reasonably easy to distribute to many machines. Computing how
     relevant a word is to a document is another
     task that benefit from multi-round, distributed computation. When a document
     is processed by indexers on multiple machine, words are extracted and a
@@ -192,7 +192,10 @@
     deduplication: It might be the case that the pages of many different URLs
     have essentially the same content. Yioop! creates a hash of the web page
     content of each downloaded url. Amongst urls with the same hash only the
-    one that is linked to the most will be returned after grouping.
+    one that is linked to the most will be returned after grouping. Finally,
+    if a user wants to do more sophisticated post-processing such as clustering
+    or computing page, Yioop! supports a straightforward architecture
+    for indexing plugins.
     </p>
     <p>
     There are several open source crawlers which can scale to crawls in the
@@ -235,8 +238,8 @@
     format</a> are often used by TREC conferences to store test data sets such
     as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the
     <a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">ClueWeb Dataset</a>.
-    In addition, it is used by <a href="http://grub.org/">grub.org</a>, a
-    distributed, open-source, search engine project in C#.
+    In addition, it was used by grub.org (hopefully, only on a
+    temporary hiatus), a distributed, open-source, search engine project in C#.
     Another important format for archiving web pages is the XML format used by
     <a href="http://www.wikipedia.org/">Wikipedia</a> for archiving MediaWiki
     wikis. Wikipedia offers <a
@@ -280,11 +283,17 @@
     deploy.</li>
     <li>It determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
+    <li>Since version 0.70, Yioop indexes are positional rather than
+    bag of word indexes, and a index compression scheme called Modified9
+    is used.</li>
     <li>Yioop! supports a GUI interface which makes
     it easy to combine results from several crawl indexes to create unique
     result presentations.</li>
     <li>Indexing occurs as crawling happens, so when a crawl is stopped,
     it is ready to be used to handle search queries immediately.</li>
+    <li>Yioop! supports an indexing plugin architecture to make it
+    possible to write one's own indexing modules that do further
+    post-processing.</li>
     <li>Yioop! has a GUI form that allows users to specify meta words
     to be injected into an index based on whether a downloaded document matches
     a url pattern.</li>
@@ -347,7 +356,7 @@ extension=php_curl.dll
     <h3>Memory Requirements</h3>
     <p>In addition, to the prerequisite software listed above, Yioop! also
     has certain memory requirements. By default bin/queue_server.php
-    requires 110MB, bin/fetcher.php requires 800MB, and index.php requires
+    requires 1200MB, bin/fetcher.php requires 800MB, and index.php requires
     200MB. These  values are set near the tops of each of these files in turn
     with a line like:</p>
 <pre>
@@ -573,11 +582,20 @@ which is initially copied into the WORK_DIRECTORY to serve as the database
 of allowed users for the Yioop! system.</dd>
 <dt>lib</dt><dd>This folder is short for library. It contains all the common
 classes for things like indexing, storing data to files, parsing urls, etc.
-lib contains two main subfolders: processors and index_bundle_iterators.
-The processors folder contains processors to extract page summaries for
-a variety of different mimetypes. The index_bundle_iterator folder contains
-a variety of iterators useful for iterating over lists of documents
-which might be returned during a query to the search engine.</dd>
+lib contains six subfolders: <i>archive_bundle_iterators</i>,
+<i>compressors</i>, <i>index_bundle_iterators</i>, <i>indexing_plugins</i>,
+<i>processors</i>, and <i>stemmers</i>. The <i>archive_bundle_iterators</i>
+folder has iterator for iterating over the objects of various kinds of
+web archive file formats, such as arc, wiki-media, etc.
+These iterators are used to iterate over such archives during
+a recrawl. The <i>compressors</i> folder contains classes that might be used
+to compress objects in a web_archive. The <i>index_bundle_iterator</i>
+folder contains a variety of iterators useful for iterating over lists of
+documents which might be returned during a query to the search engine.
+The <i>processors</i> folder contains processors to extract page summaries for
+a variety of different mimetypes. The <i>stemmers</i> folder is where word
+stemmers for different languages would appear. Right now only an
+English porter stemmer is present in this folder.</dd>
 <dt>locale</dt><dd>This folder contains the default locale data which comes
 with the Yioop! system. A locale encapsulates data associated with a
 language and region. A locale is specified by an
@@ -701,25 +719,39 @@ width="70%"/>
 <p>For each result back from the query, the title is a link to the page
 that matches the query term. This is followed by a brief summary of
 that page with the query words bolded. Then the document rank, relevancy,
-and overall scores are listed. Each of these results is a grouped statistic,
-several "micro index entry" are grouped together to create each. So even though
+proximity, and overall scores are listed. Each of these results
+is a grouped statistic, several "micro index entry" are grouped together/summed
+to create each. So even though
 a given "micro index entry" might have a document rank between 1 and 10 there
-sum could be a larger value. After these scores there are three links:
+sum could be a larger value. Further, the overall score is a
+generalized inner product of the scores of the "micro index entries",
+so the separated scores will not typically sum to the overall score.
+After these scores there are three links:
 Cached, Similar, and InLinks. Clicking on Cached will display Yioop's downloaded
-copy of the page in question. It will list the time of download and highlight
+copy of the page in question. We will describe this in more detail
+in a moment. Clicking on Similar causes Yioop! to locate the five
+words with the highest relevancy scores for that document and then to perform
+a search on those words. Clicking on InLinks will take you to a page
+consisting of all the links that Yioop! found to the document in question.
+Finally, clicking on an IP address link returns all documents that were
+crawled from that IP address.</p>
+<img src='resources/Cache.png' alt='Example Cache Results'
+width="70%"/>
+<p>As the above illustrates, on a cache link click,
+Yioop! will list the time of download and highlight
 the query terms. It should be noted that cached copies of web pages are
 stored on the fetcher which originally downloaded the page. The IndexArchive
 associated with a crawl is stored on the queue server and can be moved
 around to any location by simply moving the folder. However, if an archive
 is moved off the network on which fetcher lives, then the look up of a
-cached page might fail. Clicking on Similar causes Yioop! to locate the five
-words with the highest relevancy scores for that document and then to perform
-a search on those words. Clicking on InLinks will take you to a page
-consisting of all the links that Yioop! found to the document in question.
-Finally, clicking on an IP address link returns all documents that were
-crawled from that IP address.
-</p>
-<p>A basic query to the Yioop! search form is a typically a sequence of
+cached page might fail. On the cached page there is a "Toggle
+extracted summary" link. Clicking this will show the title, summary, and
+links that were extracted from the full page and indexed. No other terms
+on the page could be used to locate the page via a search query. This
+can be viewed as an "SEO" view of the page</p>
+<img src='resources/CacheSEO.png' alt='Example Cache SEO Results'
+width="70%"/>>
+<p>A basic query to the Yioop! search form is typically a sequence of
 words seperated by whitespace. This will cause Yioop! to compute a
 "conjunctive query", it will look up only those documents which contain all of
 the terms listed. Yioop! also supports a variety of other search box
@@ -965,6 +997,41 @@ php fetcher.php stop</pre>
     is not possible to resume the crawl. We have now described what is
     necessary to perform a crawl we now return to how to set the
     options for how the crawl is conducted.</p>
+    <h3>Common Crawl and Search Configurations</h3>
+    <p>When testing Yioop!, it is quite common just to have one instance
+    of the fetcher and one instance of the queue_server running, both on
+    the same machine. In this subsection we wish to briefly describe some
+    other configurations which are possible and also some configs/config.php
+    configurations that can affect the crawl and search speed. The most obvious
+    config.php setting which can affect the crawl speed is
+    NUM_MULTI_CURL_PAGES. A fetcher when performing downloads, opens this
+    many simultaneous connections, gets the pages corresponding to them,
+    processes them, then proceeds to download the next batch of
+    NUM_MULTI_CURL_PAGES pages. Yioop! uses the fact that there are gaps
+    in this loop where no downloading is being done to ensure robots.txt
+    Crawl-delay directives are being honored (a Crawl-delayed host will
+    only be scheduled to at most one fetcher at a time). The downside of this
+    is that your internet connection might not be used to its fullest ability
+    to download pages. Thus, it can make sense rather than increasing
+    NUM_MULTI_CURL_PAGES, to install multiple copies of Yioop! on a machine,
+    and run the fetcher program in each to maximize download speeds for a
+    machine. The most general crawl configuration for Yioop! is thus
+    typically a single queue_server and multiple machines each running multiple
+    copies of the fetcher software.
+    </p>
+    <p>Once a crawl is complete, one can see its contents in the folder
+    WORK DIRECTORY/cache/IndexDataUNIX_TIMESTAMP. Putting the WORK_DIRECTORY
+    on a solid-state drive can, as you might expect, greatly speed-up how fast
+    search results will be served. Unfortunately, for even a single
+    crawl of ten million or so pages, the corresponding IndexDataUNIX_TIMESTAMP
+    folder might be around 200 GB. Two main sub-folders of
+    IndexDataUNIX_TIMESTAMP largely determine the search performance of
+    Yioop! handling queries from a crawl. These are the dictionary subfolder
+    and the posting_doc_shards subfolder, where the former has the greater
+    influence. On a ten million page crawl these might be 5GB and 30GB
+    respectively. It is completely possible to copy these subfolders to
+    a SSD and use symlinks to them under the original crawl directory to
+    enhance Yioop!'s search performance.</p>
     <h3>Specifying Crawl Options</h3>
     <p>As we pointed out above, next to the Start Crawl button is an Options
     link. Clicking on this link, should display the following activity:</p>
@@ -1037,7 +1104,7 @@ php fetcher.php stop</pre>
     <p>When configuring a new instance of Yioop! the file default_crawl.ini
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
     for the Options form. </p>
-    <p>The last part of the Edit Crawl Options form allows you to create
+    <p>The next part of the Edit Crawl Options form allows you to create
     user-defined "meta-words". In Yioop! terminology, a meta-word is a word
     which wasn't in a downloaded document, but which is added to the
     inverted-index as if it had been in the document. The addition of
@@ -1056,6 +1123,17 @@ php fetcher.php stop</pre>
     with the document. Meta-words are useful to create shorthands for
     searches on certains kinds of sites like dictionary sites, and wikis.
     </p>
+    <p>The last part of the Edit Crawl Options form allows you to select which
+    indexing plugins you would like to use during the crawl. For instance,
+    clicking the RecipePlugin checkbox would cause Yioop! to run the code
+    in indexing_plugins/recipe_plugin.php. This code tries to detect pages
+    which are food recipes and separately extracts these recipes and clusters
+    them by ingredient. The extract recipe pages is done by the pageProcessing
+    callback in the RecipePlugin class of recipe_plugin.php; the clustering
+    is done in RecipePlugin's postProcessing method. The first method is
+    called by Yioop! for each active plugin on each page downloaded. The second
+    method is called during the stop crawl process of Yioop!
+    </p>
     <h4>Archive Crawl Options</h4>
     <p>We now consider how to do crawls of previously obtained archives.
     From the initial crawl options screen clicking on the Archive Crawl
@@ -1238,7 +1316,7 @@ OdpRdfArchiveBundle
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>

-    <h2 id='hacking'>Hacking Yioop!</h2>
+    <h2 id='hacking'>Customizing Yioop!</h2>
     <p>One advantage of an open-source project is that you have complete
     access to the source code. Thus, you can modify Yioop! to fit in
     with your existing project or add new feel free to add new features to
@@ -1375,113 +1453,203 @@ ASCII
 &lt;head&gt;
     &lt;base href="http://www.ucanbuyart.com/" /&gt;
    &lt;/pre&gt;
-....
-
+</pre>
+    <h3>Writing an Indexing Plugin</h3>
+    <p>An indexing plugin provides a way that an advanced end-user
+    can extend the indexing capabilities of Yioop! Bundled with
+    Version 0.70 of Yioop! is an example recipe indexing plugin which
+    can serve as a guide for writing your own plugin. It is
+    found in the folder lib/indexing_plugins. This recipe
+    plugin is used to detect food recipes which occur on pages during a crawl.
+    It creates "micro-documents" associated with found recipes. These
+    are stored in the index during the crawl under the meta-word "recipe:all".
+    After the crawl is over, the recipe plugin's postProcessing method is
+    called. It looks up all the documents associated with the word "recipe:all".
+    It extracts ingredients from these and does clustering of recipes
+    based on ingredient. It finally injects new meta-words of the form
+    "ingredient:some_food_ingredient", which can be used to retrieve recipes
+    most closely associated with a given ingredient. As it is written,
+    recipe plugin assumes that all the recipes can be read into memory in
+    one go, but one could easily imagine reading through the list of recipes
+    in batches of the amount that could fit in memory in one go.
+    </p>
+    <p>The recipe plugin illustrates the kinds of things that can be
+    written using indexing plugins. To make your own plugin, you
+    would need to write a subclass of the class IndexingPlugin with a
+    file name of the form mypluginname_plugin.php. Then you would need
+    to put this file in the folder lib/indexing_plugins. In the file
+    configs/config.php you would need to add the string "mypluginname" to
+    the array $INDEXING_PLUGINS. To properly subclass IndexingPlugin,
+    your class needs to implement four methods:
+    pageProcessing($page, $url), postProcessing($index_name),
+    getProcessors(), getAdditionalMetaWords(). If your plugin needs
+    to use any page processor or model classes, you should modify the
+    $processors and $model instance array variables of your plugin to
+    list the ones you need. During a web crawl, after a fetcher has downloaded
+    a batch of web pages, it uses a page's mimetype to determine a page
+    processor class to extract summary data from that page. The page processors
+    that Yioop! implements can be found in the folder lib/processors. They
+    have file names of the form someprocessorname_processor.php. As a crawl
+    proceeds, your plugin will typically be called to do further processing
+    of a page only in addition to some of these processors. The static method
+    getProcessors() should return an array of the form array(
+    "someprocessorname1", "someprocessorname2", ...), listing the processors
+    that your plugin will do additional processing of documents for.
+    A page processor has a method handle($page, $url) called by Yioop!
+    with a string $page of a downloaded document and a string $url of where it
+    was downloaded from. This method first calls the process($page, $url)
+    method of the processor to do initial summary extraction and then calls
+    method pageProcessing($page, $url) of each indexing_plugin associated with
+    the given processor. A pageProcessing($page, $url) method is expected
+    to return an array of subdoc arrays found on the given page. Each subdoc
+    array should haves a CrawlConstants::TITLE and a CrawlConstants::DESCRIPTION.
+    The handle method of a processor will add to each subdoc the
+    fields: CrawlConstants::LANG, CrawlConstants::LINKS, CrawlConstants::PAGE,
+    CrawlConstants::SUBDOCTYPE. The SUBDOCTYPE is the name of the plugin.
+    The resulting "micro-document" is inserted by Yioop! into the index
+    under the word nameofplugin:all . After the crawl is over, Yioop!
+    will call the postProcessing($index_name) method of each indexing plugin
+    that was in use. Here $index_name is the timestamp of the crawl. Your
+    plugin can do whatever post processing it wants in this method.
+    For example, the recipe plugin does searches of the index and uses
+    the results of these searches to inject new meta-words into the index.
+    In order for Yioop! to be aware of the meta-words you are adding, you
+    need to implement the method getAdditionalMetaWords().
+    Also, the web snippet you might want in the search results for things
+    like recipes might be longer or shorter than a typical result snippet.
+    The getAdditionalMetaWords() method also tells Yioop! this information.
+    For example, for the recipe plugin, getAdditionalMetaWords() returns
+    the associative array:</p>
+    <pre>
+    array("recipe:" => HtmlProcessor::MAX_DESCRIPTION_LEN,
+            "ingredient:" => HtmlProcessor::MAX_DESCRIPTION_LEN);
+    </pre>
+    <p>This completes the discussion of how to write an indexing plugin.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>

     <h2 id="references">References</h2>
     <dl>
-<dt id="APC2003">[APC2003]</dt><dd>Serge Abiteboul and Mihai Preda and
-Gregory Cobena.
+<dt id="APC2003">[APC2003]</dt>
+<dd>Serge Abiteboul and Mihai Preda and Gregory Cobena.
 <a href="http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf"
 >Adaptive on-line page importance computation</a>.
 In: Proceedings of the 12th international conference on World Wide Web.
 pp.280-290. 2003.
 </dd>
-<dt id="B1970">[B1970]</dt><dd> Bloom, Burton H. <a
-href="http://dx.doi.org/10.1145%2F362686.362692"
->Space/time trade-offs in hash coding with allowable errors</a>. Communications
-of the ACM Volume 13 Issue 7. pp. 422–426. 1970.
+<dt id="B1970">[B1970]</dt>
+<dd>Bloom, Burton H.
+<a href="http://dx.doi.org/10.1145%2F362686.362692"
+>Space/time trade-offs in hash coding with allowable errors</a>.
+Communications of the ACM Volume 13 Issue 7. pp. 422–426. 1970.
 <dd>
-<dt id="BSV2004">[BSV2004]</dt><dd>
-Paolo Boldi and  Massimo Santini and Sebastiano Vigna. <a
-href="http://vigna.dsi.unimi.it/ftp/papers/ParadoxicalPageRank.pdf"
+<dt id="BSV2004">[BSV2004]</dt>
+<dd>Paolo Boldi and  Massimo Santini and Sebastiano Vigna.
+<a href="http://vigna.dsi.unimi.it/ftp/papers/ParadoxicalPageRank.pdf"
 >Do Your Worst to Make the Best:
 Paradoxical Effects in PageRank Incremental Computations</a>.
 Algorithms and Models for the Web-Graph. pp. 168–180. 2004. </dd>
-<dt id='BP1998'>[BP1998]</dt><dd>Brin, S. and Page, L. <a
-    href="http://infolab.stanford.edu/~backrub/google.html"
+<dt id='BP1998'>[BP1998]</dt>
+<dd>Brin, S. and Page, L.
+<a  href="http://infolab.stanford.edu/~backrub/google.html"
     >The Anatomy of a Large-Scale Hypertextual Web Search Engine</a>.
 In: Seventh International World-Wide Web Conference
 (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.</dd>
-<dt id='BCC2010'>[BCC2010]</dt><dd>S. Büttcher, C. L. A. Clarke,
-and G. V. Cormack.
+<dt id='BCC2010'>[BCC2010]</dt>
+<dd>S. Büttcher, C. L. A. Clarke, and G. V. Cormack.
 <a href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307"
 >Information Retrieval: Implementing and Evaluating Search Engines</a>.
 MIT Press. 2010.</dd>
-<dt id="DG2004">[DG2004]</dt><dd>Jeffrey Dean and Sanjay Ghemawat.
+<dt id="DG2004">[DG2004]</dt>
+<dd>Jeffrey Dean and Sanjay Ghemawat.
 <a href="http://labs.google.com/papers/mapreduce-osdi04.pdf"
 >MapReduce: Simplified Data Processing on Large Clusters</a>.
 OSDI'04: Sixth Symposium on Operating System Design and Implementation. 2004<dd>
-<dt id="GGL2003">[GGL2003]</dt><dd>Sanjay Ghemawat, Howard Gobioff, and
-Shun-Tak Leung.
-<a href="http://labs.google.com/papers/mapreduce-osdi04.pdf">The
-Google File System</a>. 19th ACM Symposium on Operating Systems Principles.
-2003.</dd>
-<dt id='H2002'>[H2002]</dt><dd>T. Haveliwala.
+<dt id="GGL2003">[GGL2003]</dt>
+<dd>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
+<a href="http://labs.google.com/papers/mapreduce-osdi04.pdf
+">The Google File System</a>.
+19th ACM Symposium on Operating Systems Principles. 2003.</dd>
+<dt id='H2002'>[H2002]</dt>
+<dd>T. Haveliwala.
 <a href="
-http://infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf">
-Topic-Sensitive PageRank</a>. Proceedings of the Eleventh International
+http://infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf"
+>Topic-Sensitive PageRank</a>. Proceedings of the Eleventh International
 World Wide Web Conference (Honolulu, Hawaii). 2002.</dd>
-<dt id="KSV2010">[KSV2010]</dt><dd>Howard Karloff, Siddharth Suri, and
-Sergei Vassilvitskii.
+<dt id="KSV2010">[KSV2010]</dt>
+<dd>Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii.
 <a href="http://www.siam.org/proceedings/soda/2010/SODA10_076_karloffh.pdf"
 >A Model of Computation for MapReduce</a>. Proceedings of the ACM
 Symposium on Discrete Algorithms. 2010. pp. 938-948.</dd>
 <dt id="KC2004">[KC2004]</dt><dd>Rohit Khare and Doug Cutting.
 <a href="http://www.master.netseven.it/files/262-Nutch.pdf"
 >Nutch: A flexible and scalable open-source web search engine</a>.
- CommerceNet Labs Technical Report 04. 2004.</dd>
-<dt id="LDH2010">[LDH2010]</dt><dd>Jimmy Lin and Chris Dyer.
+CommerceNet Labs Technical Report 04. 2004.</dd>
+<dt id="LDH2010">[LDH2010]</dt>
+<dd>Jimmy Lin and Chris Dyer.
 <a href="http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf"
 >Data-Intensive Text Processing with MapReduce</a>.
 Synthesis Lectures on Human Language Technologies.
 Morgan and Claypool Publishers. 2010.</dd>
-<dt id="LM2006">[LM2006]</dt><dd>Amy N. Langville and Carl D. Meyer. <a
-    href="http://press.princeton.edu/titles/8216.html">Google's
-PageRank and Beyond</a>. Princton University Press. 2006.</dd>
-<dt id="MKSR2004">[MRS2008]</dt><dd>C. D. Manning, P. Raghavan and H. Schütze.
+<dt id="LM2006">[LM2006]</dt>
+<dd>Amy N. Langville and Carl D. Meyer.
+<a  href="http://press.princeton.edu/titles/8216.html"
+>Google's PageRank and Beyond</a>.
+Princton University Press. 2006.</dd>
+<dt id="MKSR2004">[MRS2008]</dt>
+<dd>C. D. Manning, P. Raghavan and H. Schütze.
 <a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html"
 >Introduction to Information Retrieval</a>.
 Cambridge University Press. 2008.</dd>
-<dt id="MKSR2004">[MKSR2004]</dt><dd>G. Mohr, M. Kimpton, M. Stack,
-and I.Ranitovic. <a href="http://iwaw.europarchive.org/04/Mohr.pdf"
+<dt id="MKSR2004">[MKSR2004]</dt>
+<dd>G. Mohr, M. Kimpton, M. Stack, and I.Ranitovic.
+<a href="http://iwaw.europarchive.org/04/Mohr.pdf"
 >Introduction to Heritrix, an archival quality web crawler</a>.
 4th International Web Archiving Workshop. 2004. </dd>
-<dt id='P1997a'>[P1997a]</dt><dd>J. Peek. Summary of the talk:
-<a href="http://www.usenix.org/publications/library/proceedings/ana97/
+<dt id='P1997a'>[P1997a]</dt>
+<dd>J. Peek.
+Summary of the talk: <a href="
+http://www.usenix.org/publications/library/proceedings/ana97/
 summaries/monier.html">The AltaVista Web Search Engine</a> by Louis Monier.
-USENIX Annual Technical Conference Anaheim, California. ;login: Volume 22.
+USENIX Annual Technical Conference Anaheim. California. ;login: Volume 22.
 Number 2. April 1997.</dd>
-<dt id='P1997b'>[P1997b]</dt><dd>J. Peek. Summary of the talk:
-<a href="http://www.usenix.org/publications/library/proceedings/
+<dt id='P1997b'>[P1997b]</dt>
+<dd>J. Peek.
+Summary of the talk: <a href="
+http://www.usenix.org/publications/library/proceedings/
 ana97/summaries/brewer.html">The Inktomi Search Engine</a> by Louis Monier.
-USENIX Annual Technical Conference Anaheim, California. ;login: Volume 22.
+USENIX Annual Technical Conference. Anaheim, California. ;login: Volume 22.
 Number 2. April 1997.</dd>
-<dt id="P1994">[P1994]</dt><dd>B. Pinkerton.
+<dt id="P1994">[P1994]</dt>
+<dd>B. Pinkerton.
 <a href="http://web.archive.org/web/20010904075500/http://archive.ncsa.uiuc.edu/
 SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html"
 >Finding what people want: Experiences with the WebCrawler</a>.
 In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.
 1994.</dd>
-<dt id="P1980">[P1980]</dt><dd>M.F. Porter.
+<dt id="P1980">[P1980]</dt>
+<dd>M.F. Porter.
 <a href="http://tartarus.org/~martin/PorterStemmer/def.txt"
 >An algorithm for suffix stripping.</a>
-Program. Volume 14 Issue 3. 1980. pp 130−137.  On the same website, there
-are <a
+Program. Volume 14 Issue 3. 1980. pp 130−137.
+On the same website, there are <a
 href="http://snowball.tartarus.org/">stemmers for many other languages</a>.</dd>
-<dt id='PDGQ2006'>[PDGQ2006]</dt><dd>Rob Pike, Sean Dorward, Robert Griesemer,
-Sean Quinlan. <a href="http://labs.google.com/papers/sawzall-sciprog.pdf"
->Interpreting the Data: Parallel Analysis with Sawzall</a>. Scientific
-Programming Journal. Special Issue on Grids and Worldwide Computing Programming
-Models and Infrastructure. Volume 13. Issue 4. 2006. pp. 227-298.</dd>
-<dt id="W2009">[W2009]</dt><dd>Tom White. <a href="http://www.amazon.com/gp/
+<dt id='PDGQ2006'>[PDGQ2006]</dt>
+<dd>Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan.
+<a href="http://labs.google.com/papers/sawzall-sciprog.pdf"
+>Interpreting the Data: Parallel Analysis with Sawzall</a>.
+Scientific Programming Journal. Special Issue on Grids and Worldwide Computing
+Programming Models and Infrastructure.Volume 13. Issue 4. 2006. pp.227-298.</dd>
+<dt id="W2009">[W2009]</dt>
+<dd>Tom White.
+<a href="http://www.amazon.com/gp/
 product/1449389732/ref=pd_lpo_k2_dp_sr_1?pf_rd_p=486539851&
 pf_rd_s=lpo-top-stripe-1&pf_rd_t=201&pf_rd_i=0596521979&pf_rd_m=ATVPDKIKX0DER&
-pf_rd_r=0N5VCGFDA7V7MJXH69G6">Hadoop: The Definitive Guide.</a>.
+pf_rd_r=0N5VCGFDA7V7MJXH69G6">Hadoop: The Definitive Guide</a>.
 O'Reilly. 2009.</dd>
-<dt id="ZCTSR2004">[ZCTSR2004]</dt><dd>Hugo Zaragoza, Nick Craswell,
-Michael Taylor, Suchi Saria, and Stephen Robertson. <a
+<dt id="ZCTSR2004">[ZCTSR2004]</dt>
+<dd>Hugo Zaragoza, Nick Craswell, Michael Taylor,
+Suchi Saria, and Stephen Robertson.
+<a
 href="http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf"
 >Microsoft Cambridge at TREC-13: Web and HARD tracks</a>.
 In Proceedings of 3th Annual Text Retrieval Conference. 2004.</dd>
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index d4dec9a..b249e03 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,11 +2,12 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2bcaab620e468206b752e7a45925f3a4cd37d111&hb=48f31c80c530bfd5fae9a38f6f78299eacf3af48&t=zip"
+    >Version 0.70-ZIP</a></li>
+</li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2c08046b95bb12ad08cc97323e5932a83130fe2d&hb=ac7fb82687b8724230040162e97774f18333d7a7&t=zip"
     >Version 0.68-ZIP</a></li>
 </li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2c56413a3897c12d8f9ef29c102ebc552b2e668f&hb=512cd7bfd0c373c5ae689c65e8e948ddb38963c1&t=zip"
-    >Version 0.66-ZIP</a></li>
 </ul>
 <h2>Git Repository</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index 081e94d..bbb977d 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -8,29 +8,30 @@ results for a set of urls or domains.
 <h2>Goals</h2>
 <p>Yioop! was designed with the following goals in mind:</p>
 <ul>
-<li><b>To lower the barrier of entry for people wanting to obtain personal
-crawls of the web.</b> At present, it requires only a web server such as Apache
-and command line access to a default build of PHP 5.3 or better. Configuration
-can be done using a GUI interface.</li>
-<li><b>To allow for distributed crawling of the web.</b> To get a snapshot of
+<li><b>Make it easier to obtain personal crawls of the web.</b> Only a web
+server such as Apache and command line access to a default build of PHP 5.3
+or better is needed. Configuration can be done using a GUI interface.</li>
+<li><b>Support distributed crawling of the web, if desired.</b> To download
 many web pages quickly, it is useful to have more than one machine when crawling
 the web. If you have several machines at home, simply install the software
 on all the machines you would like to use in a web crawl. In the configuration
 interface give the URL of the machine you would like to serve search results
 from. Start the queue server on that machine and start fetchers on each of the
 other machines.</li>
-<li><b>To be reasonably fast and online.</b> The Yioop engine is "online" in the
-sense that it creates a word index and document ranking as it crawls rather
-than ranking as a separate step. The point is to keep the processing done by any
+<li><b>Be fast and online.</b> The Yioop is "online" in the
+that it creates a word index and document ranking as it crawls rather
+than ranking as a separate step. This keeps the processing done by any
 machine as low as possible so you can still use them for what you bought them
 for. Nevertheless, it is reasonably fast: four Lenova Q100 fetchers and
 a 2006 MacMini queue server can crawl and index a million pages every couple
-days.</li>
-<li><b>To make it easy to archive crawls.</b> Crawls are stored in timestamped
+days. A single 2010 Mac Mini running four fetchers on the same machine
+can also achieve this rate. More fetchers, of course, allows for faster crawls
+</li>
+<li><b>Make it easy to archive crawls.</b> Crawls are stored in timestamped
 folders, which can be moved around zipped, etc. Through the admin interface you
 can select amongst crawls which exist in a crawl folder as to which crawl you
 want to serve from.</li>
-<li><b>To make it easy to crawl archives.</b> There are many sources of
+<li><b>Make it easy to crawl archives.</b> There are many sources of
 raw web data available today such as files that use the Internet Archive's
 arc format, Open Directory Project RDF data, Wikipedia xml dumps, etc. Yioop!
 can index these formats directly, allowing one to get an index for these

ViewGit