Updating docs for Version 0.82, a=chris

Chris Pollett [2012-02-03 19:Feb:rd]
Updating docs for Version 0.82, a=chris
Filename
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 71230cc..fbba680 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.80</h1>
+<h1>Yioop! Documentation v 0.82</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -15,7 +15,9 @@
         <li><a href="#filter">Search Filter</a></li>
         <li><a href="#machines">GUI for Managing Machines and Servers</a></li>
         <li><a href="#localizing">Localizing Yioop! to a New Language</a></li>
-        <li><a href="#embedding">Embedding Yioop! in a Site</a></li>
+        <li><a href="#framework">Building a Site using Yioop! as Framework</a>
+        </li>
+        <li><a href="#embedding">Embedding Yioop! in an Existing Site</a></li>
         <li><a href="#customizing">Customizing Yioop!</a></li>
         <li><a href="#commandline">Yioop! Command-line Tools</a></li>
         <li><a href="#references">References</a></li>
@@ -24,26 +26,26 @@
     <h2 id="intro">Introduction</h2>
     <p>The Yioop! search engine is designed to allow users
     to produce indexes of a web-site or a collection of
-    web-sites whose total number of pages are in the tens of millions.
-    In contrast, a search-engine like Google maintains an index of tens of
-    billions of pages. Nevertheless, since you, the user, have control over the
-    exact sites which are being indexed with Yioop! you have much better control
-    over the kinds of results that a search will return. Yioop! provides
-    a traditional web interface to do queries, an rss api, and a function api.
-    In this section we will discuss some of the different search engine
-    technologies which exist today, how Yioop! fits into this eco-system, and
-    when Yioop! might be the right choice for your search engine needs. In the
-    remainder of this document after the introduction, we will discuss how to
-    get and install Yioop!, the files and folders used in Yioop!,
-    user, role, crawl, and machine management in the Yioop! system,
-    localization in the Yioop! system, embedding Yioop! in an existing web-site,
+    web-sites whose total number of pages are in the tens or low hundreds
+    of millions. In contrast, a search-engine like Google maintains an index
+    of tens of billions of pages. Nevertheless, since you, the user, have
+    control over the exact sites which are being indexed with Yioop!, you have
+    much better control over the kinds of results that a search will return.
+    Yioop! provides a traditional web interface to do queries, an rss api,
+    and a function api. In this section we discuss some of the different
+    search engine technologies which exist today, how Yioop! fits into this
+    eco-system, and when Yioop! might be the right choice for your search
+    engine needs. In the remainder of this document after the introduction,
+    we discuss how to get and install Yioop!, the files and folders used
+    in Yioop!, user, role, crawl, and machine management in the Yioop! system,
+    localization in the Yioop! system, building a site using the Yioop!
+    framework, embedding Yioop! in an existing web-site,
     customizing Yioop!, and the Yioop! command-line tools.
     </p>
     <p>Since the mid-1990s a wide variety of search engine technologies
     have been explored. Understanding some of this history is useful
-    in understanding Yioop! capabilities.</p>
-    <p>In 1994, Web Crawler, one of the earliest
-    still widely-known search engines, only had an
+    in understanding Yioop! capabilities. In 1994, Web Crawler, one of the
+    earliest still widely-known search engines, only had an
     index of about 50,000 pages which was stored in an Oracle database.
     Today, databases are still used to create indexes for small to medium size
     sites. An example of such a search engine written in PHP is
@@ -56,8 +58,9 @@
     of database systems although techniques like table sharding can help to
     some degree. The Yioop! engine uses a database to manage some things
     like users and roles, but uses its own web archive format and indexing
-    technologies to handle crawl data.</p>
-    <p>When the site that is being indexed consists of dynamic pages rather than
+    technologies to handle crawl data. This is one of the reasons that
+    Yioop! can scale to larger indexes.</p>
+    <p>When a site that is being indexed consists of dynamic pages rather than
     the largely static page situation considered above, and those dynamic
     pages get most of their text content from a table column or columns,
     different search index approaches are often used. Many database management
@@ -67,11 +70,15 @@
     href="http://www.sphinxsearch.com/">Sphinx</a>. However, for these
     approaches to work the text you are indexing needs to be in a database
     column or columns. Nevertheless, these approaches illustrate another
-    common thread in the development of search systems: search as appliance,
+    common thread in the development of search systems: search as an appliance,
     where you either have a separate search server and access it through either
     a web-based API or through function calls. Yioop! has both a search
     function API as well as a web API that returns
-    <a href="http://www.opensearch.org">Open Search RSS results</a>.
+    <a href="http://www.opensearch.org">Open Search RSS results</a>. These
+    can be used to embed Yioop! within your existing site. If you want to
+    create a new search engine site, Yioop! offers a web-based,
+    model-view-controller framework with a web-interface for localization
+    that can serve as the basis for your app.
     </p>
     <p>
     By 1997 commercial sites like Inktomi and AltaVista already had
@@ -155,31 +162,55 @@
     <p>Yioop! tries to exploit
     these advances to use a simplified distributed model which might
     be easier to deploy in a smaller setting. Each node in a Yioop! system
-    is assumed to have a web server running. One machine
-    acts as a coordinator for crawls and runs a process in addition to the web
-    server called a queue_server. Users can send messages
-    to the queue_server using a web interface. This interface writes a message
-    file that the queue_server periodically looks for. All other nodes
-    run a fetcher program. This fetcher program periodically makes a request
-    to the coordinating computer's web server asking for messages and schedules.
-    A schedule is data to process and a message has control information about
-    what kind of processing should be done. The queue_server is responsible
-    for generating schedule files, but unlike the map-reduce model, schedules
-    might be sent to any fetcher. As a fetcher processes a schedule, it
-    periodically POSTs the result of its computation back to the coordinating
-    computer's web server. The data is then written to a set of received
+    is assumed to have a web server running. One of the Yioop! nodes
+    web app's is configured to act as a
+    coordinator for crawls. It is called the <b>name server</b>. In addition,
+    to this one might have several processes called <b>queue servers</b> that
+    perform scheduling and indexing jobs, as well as <b>fetcher</b> processes
+    which are responsible for downloading pages. Through the name server's
+    web app, users can send messages to the queue_servers and fetchers.
+    This interface writes message
+    files that queue_servers periodically looks for. Fetcher processes
+    periodically ping the name server to find the name of current crawl
+    as well as a list of queue servers. Fetcher programs then periodically
+    make requests in a round-robin fashion to the queue servers for messages
+    and schedules. A schedule is data to process and a message has control
+    information about what kind of processing should be done. A given
+    queue_server is responsible for generating schedule files for data with a
+    certain hash value, for example, to crawl urls for urls with host names
+    that hash to queue server's id.  As a fetcher processes a schedule, it
+    periodically POSTs the result of its computation back to the responsible
+    queue server's web server. The data is then written to a set of received
     files. The queue_server as part of its loop looks for received files
     and merges their results into the index so far. So the model is in a
     sense one round: URLs are sent to the fetchers, summaries of downloaded
-    pages are sent back to the queue server and merged into the index.
-    As soon as the crawl is over one can do text searches on the crawl.
+    pages are sent back to the queue servers and merged into their indexes.
+    As soon as the crawl is over one can do text search on the crawl.
     Deploying this  computation model is relatively simple: The web server
     software needs to be installed on each machine, the Yioop! software (which
     has the the fetcher, queue_server, and web app components) is copied to
-    the desired location under the web server's document folder, each
-    fetcher is configured to know who the queue_server is, and finally,
-    the fetcher's programs are run on each fetcher machine and the queue_server
-    is run of the coordinating machine. Since a multi-million page crawl might
+    the desired location under the web server's document folder, each instance
+    of Yioop! is configured to know who the name server is, and finally,
+    the fetcher programs and queue server programs are started.
+    </p>
+    <p>As an example
+    of how this scales, 2010 Mac Mini running a queue server
+    program can schedule and index about 100,000 pages/hour. This corresponds
+    to the work of about 10 fetcher processes (which can be on the same
+    machine, if you have enough memory, or different ones). The checks by
+    fetchers on the name server are lightweight, so adding another machine with
+    a queue server and the corresponding additional fetchers allows one to
+    effectively double this speed. This also has the benefit of speeding up
+    query processing as when a query comes in, it gets split into queries for
+    each of the queue server's web apps, but the query only "looks" slightly
+    more than half as far into the posting list as would occur in a single
+    queue server setting. To further increase query throughput,
+    the number queries that can be handled at a given time, Yioop! installations
+    can also be configured as "mirrors" which keep an exact copy of the
+    data stored in the site being mirrored. When query request comes into a
+    Yioop! node, either it or any of its mirrors might handle it.
+    </p>
+    <p>Since a  multi-million page crawl might
     take several days Yioop! supports the ability to dynamically change its
     crawl parameters as a crawl is going on.  This allows a user on request
     from a web admin to disallow Yioop! from continuing to crawl a site without
@@ -188,9 +219,9 @@
     This can help if someone suggests to you a site that might otherwise not
     be found by Yioop! given its original list of seed sites.
     </p>
-    <p>Despite its simpler model, Yioop! does a number of things to improve the
-    quality of its search results. For each link extracted from a page,
-    Yioop! creates a micropage which it adds to its index. This includes
+    <p>Despite its simpler one-round model, Yioop! does a number of things to
+    improve the quality of its search results. For each link extracted from a
+    page, Yioop! creates a micropage which it adds to its index. This includes
     relevancy calculations for each word in the link as well as an
     [<a href="#APC2003">APC2003</a>]-based ranking of how important the
     link was. Yioop! supports a number of iterators which can be thought of
@@ -253,7 +284,7 @@
     Web Archives is to allow one to store
     many small files compressed as one big file. They also make data from web
     crawls very portable, making them easy to copy from one location to another.
-    Like Nutch and Heritrix, Yioop! also has a command line tool for quickly
+    Like Nutch and Heritrix, Yioop! also has a command-line tool for quickly
     looking at the contents of such archive objects.
     </p>
     <p>The <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC
@@ -294,7 +325,8 @@
     <li>Yioop! is an open-source, distributed crawler and search engine
     written in PHP.</li>
     <li>It is capable of crawling and indexing small sites to sites or
-    collections of sites containing millions of documents.</li>
+    collections of sites containing ten million or low hundred of millions
+    of documents.</li>
     <li>On a given machine it uses multi-curl to support many simultaneous
     downloads of pages.</li>
     <li>It has a web interface to select seed sites for crawls and to set what
@@ -307,7 +339,7 @@
     <li>Yioop! supports dynamically changing the allowed and disallowed
     sites while a crawl is in progress.</li>
     <li>Yioop! supports dynamically injecting new seeds site via a web
-    interface into the active crawl</li>
+    interface into the active crawl.</li>
     <li>It supports the indexing of many different filetypes including:
     HTML, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps,
     SVG, XLSX, and XML. It has a web interface for controlling which amongst
@@ -318,6 +350,9 @@
     deploy.</li>
     <li>The fetcher/queue_server processes on several machines can be
     managed through the web interface of a main Yioop! instance.</li>
+    <li>Yioop! installations can created with a variety of topologies:
+    one queue_server and many fetchers or several queue_servers and
+    many fetchers.</li>
     <li>It determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
     <li>Since version 0.70, Yioop indexes are positional rather than
@@ -340,7 +375,7 @@
     <li>Yioop! uses a web archive file format which makes it easy to
     copy crawl results amongst different machines. It has a command-line
     tool for inspecting these archives if they need to examined
-    in a non-web setting. It also supports command line search querying
+    in a non-web setting. It also supports command-line search querying
     of these archives.</li>
     <li>Using web archives, crawls can be mirrored amongst several machines
     to speed-up serving search results. This can be further sped-up
@@ -351,6 +386,10 @@
     <li>Yioop! supports importing data from ARC, MediaWiki XML, and ODP
     RDF files, it also supports re-indexing of data from WebArchives created
     since version 0.66.</li>
+    <li>Yioop! comes with its own extendable model-view-controller
+    framework that you can use directly to create new sites that use
+    Yioop! search technology. This framework also comes with a GUI
+    which makes it easy to localize strings and static pages.</li>
     <li>Besides standard output of a web page with ten links it is possible
     to get query results in Open Search RSS format and also to query
     Yioop! data via a function api.</li>
@@ -362,7 +401,9 @@
     better (Yioop! used only to serve search results from a pre-built index
     has been tested to work in PHP 5.2), (3) Curl libraries for downloading
     web pages. To be a little more specific Yioop! has been tested with
-    Apache 2.2; however, it should work with other webservers. For PHP,
+    Apache 2.2 and I've been told Version 0.82 works with lighttpd.
+    It should work with other webservers, although it might take some
+    finessing. For PHP,
     you need a build of PHP that incorporates multi-byte string (mb_ prefixed)
     functions, Curl, Sqlite (or at least PDO with Sqlite driver),
     the GD graphics library and the command-line interface. If you are using
@@ -383,9 +424,8 @@ extension=php_curl.dll

 </p>
 <p>
-    If you
-    are using the Ubuntu variant of Linux, the following lines would get the
-    software you need:
+    If you are using the Ubuntu-variant of Linux, the following lines would
+    get the software you need:
     </p>
     <pre>
     sudo apt-get install curl
@@ -403,7 +443,9 @@ extension=php_curl.dll
     allows you through a web interface to start/stop and look at the
     log files for each of the queue_servers, and fetchers that you want
     Yioop! to manage. If it is not configured then these task would need
-    to be done via the command line. On OSX and Linux, Manage Machines
+    to be done via the command line. <b>Also, if you do not use
+    the Manage Machine interface your Yioop! site can make use of only one
+    queue_server.</b> On OSX and Linux, Manage Machines
     needs to be able to schedule "at" batch jobs (type man at to find out
     more about these). On OSX to enable
     this ability, you might need to type:</p>
@@ -507,7 +549,7 @@ be checked in a production environment.
 </p>
 <p>The <b>Search Access</b> field set has three check boxes:
 Web, RSS, and API. These control whether a user can use the
-web interface to get query results, whether RSS repsonses to queries
+web interface to get query results, whether RSS responses to queries
 are permitted, or whether or not the function based search API is
 available. Using the Web Search interface
 and formatting a query url to get an RSS response are
@@ -515,7 +557,9 @@ describe in the <a href="#interface">Yioop! Search and User Interface
 section</a>. The Yioop! Search Function API is described in the
 section <a href="#embedding">Embedding Yioop!</a>, you can also look
 in the examples folder at the file search_api.php to see an example
-of how to use it.</p>
+of how to use it. <b>If you intend to use Yioop!
+in a configuration with multiple queue servers (not fetchers), then
+you the RSS check box needs to be checked.</b></p>
 <p>The <b>Database Set-up</b> fieldset is used to specify what database management
 system should be used, how it should be connected to, and what user name
 and password should be used for the connection. At present sqlite2
@@ -549,15 +593,15 @@ of each search result. Finally, the IP address checkbox toggles
 whether a link for pages with the same ip address should be displayed as part
 of each search result.</p>

-<p>The <b>Queue Server Set-up</b> fieldset is used to tell Yioop! which machine
-is going to act as a queue server during a crawl and what secret string
+<p>The <b>Name Server Set-up</b> fieldset is used to tell Yioop! which machine
+is going to act as a name server during a crawl and what secret string
 to use to make sure that communication is being done between
 legitimate queue_servers and fetchers of your installation. You can
 choose anything for your secret string as long as you use the same
 string amongst all of the machines in your Yioop! installation.
-The reason why you have to set the queue_server url is that each machine that
-is going to run a fetcher to download web pages, needs to know who the
-queue server is so they can request a batch of urls to download. There are a
+The reason why you have to set the name server url is that each machine that
+is going to run a fetcher to download web pages needs to know who the
+queue servers are so they can request a batch of urls to download. There are a
 few different ways this can be set-up:
 </p>
 <ol>
@@ -569,7 +613,7 @@ http://localhost/path_to_yioop/ or
 http://127.0.0.1/path_to_yioop/, where you appropriately modify
 "path_to_yioop".</li>
 <li>Otherwise, if you are doing a crawl on multiple machines, use
-the url of Yioop! on the machine that will act as the queue_server.</li>
+the url of Yioop! on the machine that will act as the name server.</li>
 </ol>
 <p>In communicating between the fetcher and the server, Yioop! uses
 curl. Curl can be particular about redirects in the case where posted
@@ -619,7 +663,7 @@ scripts:</p>
     (multiple machines can act as fetchers) and the
     queue_server.</dd>
 </dl>
-<p>The file index.php is essentially used when you browse to an installation
+<p>The file index.php is used when you browse to an installation
 of a Yioop! website. The description of how to use a Yioop! web site is
 given in the sections starting from the The Yioop! User Interface section.
 The files fetcher.php and queue_server.php are only connected with crawling
@@ -644,10 +688,12 @@ the Yioop! folder's various sub-folders contain:
 <dl>
 <dt>bin</dt><dd>This folder is intended to hold command-line scripts
 which are used in conjunction with Yioop! In addition to the fetcher.php
-and queue_server.php script already mentioned, it contains arc_tool.php
-and query_tool.php. The former  can be used to examine the contents of
-WebArchiveBundle's and IndexArchiveBundle's from the command line; the latter
-can be used to run queries from the command-line.</dd>
+and queue_server.php script already mentioned, it contains arc_tool.php,
+mirror.php, and query_tool.php. arc_tool.php can be used to examine the contents
+of WebArchiveBundle's and IndexArchiveBundle's from the command line.
+mirror.php can be used if you would like to create a mirror/copy of a Yioop
+installation.  Finally, query_tool.php can be used to run queries
+from the command-line.</dd>
 <dt>configs</dt><dd>This folder contains configuration files. You will
 probably not need to edit any of these files directly as you can set the most
 common configuration settings from with the admin panel of Yioop! The file
@@ -659,6 +705,9 @@ not strictly necessary as the database should be creatable via the admin panel.
 The file default_crawl.ini is copied to WORK_DIRECTORY after you set this
 folder in the admin/configure panel. There it is renamed as crawl.ini and
 serves as the initial set of sites to crawl until you decide to change these.
+bigram_builder.php is a tool which can be used to extract from the urls in a
+Wikipedia dump for a language two word pairs that should be treated as
+a logical unit during a crawl.
 </dd>
 <dt>controllers</dt><dd>The controllers folder contains all the controller
 classes used by the web component of the Yioop! search engine. Most requests
@@ -764,6 +813,13 @@ crawl.ini, bot.txt, and robot_table.txt. Here is a rough guide to what
 the WORK DIRECTORY's sub-folder contain:
     </p>
 <dl>
+<dt>app</dt><dd>This folder is used to contain your overrides to
+the views, controllers, models, resources, etc. For example, if you
+wanted to change how the search results were rendered, you could
+ass a views/search_view.php file to the app folder and Yioop! would use
+it rather than the one in the Yioop! base directory's views folder.
+Using the app dir makes it easier to have customizations that won't get
+messed up when you upgrade Yioop!</dd>
 <dt>cache</dt><dd>The directory is used to store folders of the form
 ArchiveUNIX_TIMESTAMP, IndexDataUNIX_TIMESTAMP, and QueueBundleUNIX_TIMESTAMP.
 ArchiveUNIX_TIMESTAMP folders hold complete caches of web pages that have been
@@ -1056,6 +1112,10 @@ activities in turn.
     <p>The Manage Crawl activity in Yioop! looks like:</p>
 <img src='resources/ManageCrawl.png' alt='Manage Crawl Form'/>
     <p>
+    This activity will actually list slightly different kinds of peak memory
+    usages depending on whether the queue_server's are run from a terminal
+    or through the web interface. The screenshot above was done when a
+    single queue_server was being run from the terminal.
     The first form in this activity allows you to name and start a new
     web crawl. Next to the Start New Crawl button is an Options link, which
     allows one to set the parameters under which the crawl will execute. We
@@ -1093,16 +1153,22 @@ activities in turn.
     or so.
     </p>
     <h3 id="prereqs">Prerequisites for Crawling</h3>
-    <p>Before you can start a new crawl, you need to run the queue_server.php
-    script on the machine that is going to act as the queue_server and
-    you need to run the fetcher.php script either on the same machine
-    or on at least one other machine with Yioop! installed and which has
-    been configured with the queue_server url. This can be done either via the
-    command-line or through a web interface. As described in the
+    <p>Before you can start a new crawl, you need to run at least one
+    queue_server.php script and you need to run at least one fetcher.php script.
+    These can be run either from the same Yioop! installation or from
+    separate machines or folder with Yioop! installed. Each installation of
+    Yioop! that is going to participate in a crawl should be configured with the
+    same name server and server key. Running these scripts can be done either
+    via the command line or through a web interface. As described in the
     <a href="#requirements">Requirements</a> section you might need to do some
-    addtional initial set up if you want to take the web interface approach.
-    In this section we descibe how to start the queue_server.php and fetcher.php
-    scripts via the command line; the <a href="#machines"
+    additional initial set up if you want to take the web interface approach.
+    On the other hand, the command-line approach only works if you are using
+    only one queue server. You can still have more than one fetcher, but
+    the crawl speed in this case probably won't go faster after ten to
+    twelve fetchers. Also, in the command-line approach the queue server and
+    name server should be the same instance of Yioop! In the remainder of this
+    section we describe how to start the queue_server.php and
+    fetcher.php scripts via the command line; the <a href="#machines"
     >GUI for Managing Machines and Servers</a> section describes how to do
     it via a web interface. To begin open a
     command shell and cd into the bin subfolder of the Yioop! folder. To
@@ -1117,7 +1183,9 @@ php fetcher.php terminal</pre>
     properly set in your PATH environment variable. If this is not the case,
     you would need to type the path to php followed by php then the rest of
     the line. If you want to stop these programs after starting them simply
-    type CTRL-C. On *nix (Unix, Linux, Mac) systems, it is possible to run
+    type CTRL-C. Assuming you have done the additional configuration
+    mentioned above that are needed for the GUI approach managing these
+    programs, it is also possible to run
     the queue_server and fetcher programs as daemons. To do this one could
     type respectively:
     </p>
@@ -1153,7 +1221,8 @@ php fetcher.php stop</pre>
     <h3>Common Crawl and Search Configurations</h3>
     <p>When testing Yioop!, it is quite common just to have one instance
     of the fetcher and one instance of the queue_server running, both on
-    the same machine. In this subsection we wish to briefly describe some
+    the same machine and same installation of Yioop! In this subsection
+    we wish to briefly describe some
     other configurations which are possible and also some configs/config.php
     configurations that can affect the crawl and search speed. The most obvious
     config.php setting which can affect the crawl speed is
@@ -1172,21 +1241,20 @@ php fetcher.php stop</pre>
 <pre >
 php fetcher.php start 5
 </pre>
-would start instance 5 of the fetcher program. The most general crawl
-configuration for Yioop! is thus
-    typically a single queue_server and multiple machines each running multiple
-    copies of the fetcher software.
+    would start instance 5 of the fetcher program.
     </p>
     <p>Once a crawl is complete, one can see its contents in the folder
-    WORK DIRECTORY/cache/IndexDataUNIX_TIMESTAMP. Putting the WORK_DIRECTORY
+    WORK DIRECTORY/cache/IndexDataUNIX_TIMESTAMP. In the multi-queue server
+    setting each queue server machine would have such a folder containing
+    the data for the hosts that queue server crawled. Putting the WORK_DIRECTORY
     on a solid-state drive can, as you might expect, greatly speed-up how fast
-    search results will be served. Unfortunately, for even a single
-    crawl of ten million or so pages, the corresponding IndexDataUNIX_TIMESTAMP
-    folder might be around 200 GB. Two main sub-folders of
-    IndexDataUNIX_TIMESTAMP largely determine the search performance of
+    search results will be served. Unfortunately, if a given queue server
+    is storing ten million or so pages, the corresponding
+    IndexDataUNIX_TIMESTAMP folder might be around 200 GB. Two main sub-folders
+    of IndexDataUNIX_TIMESTAMP largely determine the search performance of
     Yioop! handling queries from a crawl. These are the dictionary subfolder
     and the posting_doc_shards subfolder, where the former has the greater
-    influence. On a ten million page crawl these might be 5GB and 30GB
+    influence. For the ten million page situation these might be 5GB and 30GB
     respectively. It is completely possible to copy these subfolders to
     a SSD and use symlinks to them under the original crawl directory to
     enhance Yioop!'s search performance.</p>
@@ -1458,7 +1526,7 @@ OdpRdfArchiveBundle
     results.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='machines'>GUI for Managing Machines and Servers</h2>
-    <p>Rather than use the command-line as described in the
+    <p>Rather than use the command line as described in the
     <a href="#prereqs">Prerequisites for Crawling</a> section, it is possible
     to start/stop and view the log files of queue servers and fetcher
     through the Manage Machines activity. In order to do this, the additional
@@ -1468,27 +1536,36 @@ OdpRdfArchiveBundle
 <img src='resources/ManageMachines.png' alt='The Manage Machines form'/>
     <p>The Add machine form at the top of the page allows one to add a new
     machine to be controlled by this Yioop! instance. The Machine
-    Name field let's you give this machine an easy to remember name,
-    the Machine URL, should be the URL to the installed Yioop! instance on
-    that machine, the Has Queue Server checkbox is used to say whether
-    that machine will be running a queue server or not, and the
-    Number of Fetchers drop-down allows you to say how many fetcher instances
-    you want to be able to manage for that machine. The Delete Machine
-    form allows you to remove a machine that you either misconfigured
-    or that you no longer want to manage through this Yioop! instance.
-    To modify a machine that you have already added, you should delete it
-    and re-add it using the setting you want. The Machine Information
+    Name field let's you give this machine an easy to remember name
+    The Machine URL field should be filled in with the URL to the
+    installed Yioop! instance. The is Mirror checkbox says whether you want
+    the given Yioop! installation to act as a mirror for another Yioop!
+    installation. Checking it will reveal a drop-down menu that allows you
+    to choose which installation amongst the previously entered machines
+    you want to mirror. The Has Queue Server checkbox is used to say whether
+    the given Yioop! installation will be running a queue server or not.
+    Finally, the  Number of Fetchers drop down allows you to say how many
+    fetcher instances you want to be able to manage for that machine.
+    The Delete Machine form allows you to remove a machine that you either
+    misconfigured  or that you no longer want to manage through this Yioop!
+    instance. To modify a machine that you have already added, you should
+    delete it and re-add it using the setting you want. The Machine Information
     section of the Manage Machines activity consists of boxes for
     each machine that you have added. Each box lists the queue server,
     if any, and each of the fetchers you requested to be able to manage.
     Next to these there is a link to the log file for that server/fetcher
     and below this there is an On/Off switch for starting and stopping
     the server/fetcher. This switch is green if the server/fetcher is running
-    and red otherwise.</p>
+    and red otherwise. A similar On/Off switch is present to turn on
+    and off mirroring on a machine that is acting as a mirror.</p>
     <h2 id='localizing'>Localizing Yioop! to a New Language</h2>
-    <p>The Manage Locales activity can be used to configure Yioop
-    for use with different languages and for different regions. The
-    basic form looks like:</p>
+    <p>The Manage Locales activity can be used to configure Yioop!
+    for use with different languages and for different regions. If you decide
+    to customize your Yioop! installation by adding files to
+    WORK_DIRECTORY/app as described in the <a href="framework">Building a
+    Site using Yioop! as a Framework</a> section, then the localization
+    tools described in this section can also be used to localize your custom
+    site. Clicking the Manage Locales activity one sees a page like:</p>
 <img src='resources/ManagingLocales.png' alt='The Manage Locales form'/>
     <p>
     The first form on this activity allows you to create a new locale --
@@ -1518,17 +1595,40 @@ OdpRdfArchiveBundle
     link. This should display the following form:</p>
 <img src='resources/EditingLocaleStrings.png' alt='The Edit Locales form'/>
     <p>In the above case, the link for English was clicked. The Back link
-    in the corner can be used to written to the previous form. The
+    in the corner can be used to written to the previous form.
+    The Static Pages download has a list of all the static pages (.thtml files)
+    which are in either the folder WORK_DIRECTORY/locale/current-tag/pages
+    (in this case, current-tag is en-US) or the folder
+    WORK_DIRECTORY/locale/default-tag/pages where default-tag is the IANA tag
+    for the default language of the Yioop! installation. Selecting a page
+    allows one to edit it within Yioop!. The idea is that one might have
+    a couple of static pages you have created in the default locale pages folder
+    and a localizer can use this interface to see what is written in these
+    files. Yioop! autmatically creates these files in the directory the
+    localizer is localizing for, and the localizer can translate their contents
+    into the appropriate language. Beneath this drop-down, the
     Edit Locale page mainly consists of a two column table: the right column
     being string ids, the left column containing what should be their
     translation into the given locale. If no translation exists yet,
-    the field will be displayed in red. If you make a set of translations,
-    be sure to submit the form associated with this table by scrolling to
-    the bottom of the page and clicking the Submit link. This saves your
-    translations; otherwise, your work will be lost if you navigate away
-    from this page. One aid to translating is if you hover your mouse
-    over a field that needs translation, then its translation in the
-    default locale (usually English) is displayed. If you want to find
+    the field will be displayed in red. String ids are extracted by Yioop!
+    automatically from controller, view, helper, layout, and element class files
+    which are either in the Yioop! Installation itself or in the installation
+    WORK_DIRECTORY/app folder. Yioop! looks for tl() function calls to extract
+    ids from these files, for example, on seeing tl('search_view_query_results')
+    Yioop! would extract the id search_view_query_results; on seeing
+    tl('search_view_calculated', $data['ELAPSED_TIME']) Yioop! would extract
+    the id, 'search_view_calculated'. In the second case, the translation is
+    expected the translation to have a %s in it for the value of
+    $data['ELAPSED_TIME']. Note %s is used regardless of the the type, say
+    int, float, string, etc., of $data['ELAPSED_TIME']. tl() can handle
+    additional arguments, whenever an additional argument is supplied an
+    additional %s would be expected somewhere in the translation string.
+    If you make a set of translations, be sure to submit the form associated
+    with this table by scrolling to the bottom of the page and clicking the
+    Submit link. This saves your translations; otherwise, your work will be
+    lost if you navigate away from this page. One aid to translating is if you
+    hover your mouse over a field that needs translation, then its translation
+    in the default locale (usually English) is displayed. If you want to find
     where in the source code a string id comes from the ids follow
     the rough convention file_name_approximate_english_translation.
     So you would expect to find admin_controller_login_successful
@@ -1582,10 +1682,229 @@ OdpRdfArchiveBundle
     the entry for the language tag from $CHARGRAMS. If you add a
     language to Yioop! and want to use char gramming merely and an
     additional entry to this array.</p>
+    <h3>Adding a bigram filter for your language</h3>
+    <p>
+    Bigrams are pair of words which always occur together in the same
+    sequence in a user query, ex: "honda accord". Yioop! can treat these
+    pair of words as a single word to increase the speed and relevance
+    of retrieval. The configs/bigram_build.php script can be used to
+    create a bigram filter file for the Yioop search engine to detect
+    such words in documents and queries. The input to this script is
+    an xml file which contains a large collection of such bigrams. One
+    common source of a large set of bigrams is an XML dump of Wikipedia.
+    Wikipedia dumps are available for downloaded online free of cost. The
+    bigrams filter file is specific to a language, therefore, the user has to
+    create a separate filter file for each language that is to use this
+    functionality. The configs/bigram_builder.php script can be run multiple
+    times to create different filter files by specifying a different input xml
+    files and a different language as command-line arguments. Xml dumps of
+    Wikipedia for different specific languages are available to download, and it
+    is these language specific dumps which serve as input to this script.
+    </p>
+    <p>
+    To illustrate the use bigram_build.php, here are the steps to use it
+    in the case of wanting to create an English language bigram filter file.
+    </p>
+    <p>
+    <p><b>Step 1</b>: Go to <a href="http://dumps.wikimedia.org/enwiki/"
+    >http://dumps.wikimedia.org/enwiki/</a> and obtain a dump
+    of the English Wikipedia. This page lists all the dumps according
+    to the date they were taken. Choose any suitable date or the latest. A
+    link with a label such as 20120104/, represents a  dump taken on
+    01/04/2012. Click this link to go in turn to a
+    page which has many links based on type of content you are looking for.
+    We are interested in content titled
+    "Recombine all pages, current versions only". Beneath this we might find a
+    link with a name like:<br />
+    <b>enwiki-20120104-pages-meta-current.xml.bz2</b><br />
+    This is a bz2 compressed xml file containing all the english pages of
+    wikipedia. Download this file to the search_filters
+    folder of your Yioop! work directory associated with your profile. This
+    file is of the order of 7GB. The bigram_builder.php script though
+    uncompresses while making a filter so you should have around 100GB free
+    while creating the bigram filter (You can free up space after the process
+    if over).
+
+    <p><b>Step 2</b>: Run this script from the php command line as follows:</p>
+    <pre>
+    php bigram_builder enwiki-20120104-pages-meta-current.xml.bz2 en
+    </pre>
+    <p>
+    This creates a bigram filter en_bigrams.ftr for English in the same
+    directory. Yioop! will automatically detect the filter file and use
+    it the next time you crawl as well as when anyone performs an English
+    language query. It should be noted that Yioop! works perfectly
+    well if you don't create any bigram filters; however, bigram filters
+    enhance Yioop!'s ability to return relevant results for some languages
+    as well as speeds up certain two word queries. If you have a site you
+    crawled before creating a bigram filter, then make a bigram filter
+    then data from this older site might actually be served less well because
+    attempted bigram lookups will fail.
+    </p>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='framework'>Building a Site using Yioop! as Framework</h2>
+    <p>The Yioop! code base can serve as the code base for new custom search
+    web sites. The web-app portion of Yioop! uses a model-view-controller (MVC)
+    framework. In this set-up, sub-classes of the Model class should handle
+    file I/O and database function, sub-classes of Views should be responsible
+    for rendering outputs, and sub-classes of the Controller class
+    do calculations on data received from the web and from the models to give
+    the views the data they finally need to render. In the remainder of this
+    section we describe how this framework is implemented in Yioop! and
+    how to add code to the WORK_DIRECTORY/app folder to customize things for
+    your site. In this discussion we will use APP_DIR to refer to
+    WORK_DIRECTORY/app and BASE_DIR to refer to the directory where Yioop!
+    is installed.</p>
+
+    <p>The index.php script is the first script run by the Yioop! web app.
+    It has an array $available_controllers which lists the controllers
+    available to the script. The names of the controllers in this array are
+    lower case. Based on whether the $_REQUEST['c'] variable is in this array
+    index.php either  loads the file {$_REQUEST['c']}_controller.php or loads
+    whatever the default controller is. index.php also checks for the existing
+    of APP_DIR/index.php and loads it if it exists. This gives
+    the app developer a chance to change the available controllers and which
+    controller is set for a given request. A controller file should have in it
+    a file which extends the class Controller. Controller files should always
+    have names of the form somename_controller.php and the class inside them
+    should be named SomenameController. Notice it is Somename rather than
+    SomeName. These general naming conventions are used for models, views, etc.
+    Any Controller subclass has the fields $models, $views, and
+    $indexing_plugins. For the base class these are empty,
+    but for a subclass you create you can set them to be arrays listing the
+    names of the models, views, and indexing_plugins your class uses. Yioop!
+    tries to load each of the classes listed in these arrays. For example
+    if MyController defined:</p>
+    <pre>
+    var $view = array("search");
+    </pre>
+    <p>
+    Then Yioop! would first look for a file: APP_DIR/models/search_view.php
+    to include, if it cannot find such a file then it tries to include
+    BASE_DIR/models/search_view.php. So to change the behavior of an existing
+    BASE_DIR file one just has a modified copy of the file in the appropriate
+    place in your APP_DIR. This holds in general for other program files
+    such as views and plugins. It doesn't hold for resources such as images --
+    we'll discuss those in a moment. Notice because it looks in APP_DIR
+    first, you can go ahead and create new controllers, models, views, etc
+    which don't exists in BASE_DIR and by setting the variables up right get
+    Yioop! to load them. When an instance of the controller
+    class Yioop! is using for a request is created, Yioop! also creates
+    an instance of each View, Model and IndexingPlugin associated with that
+    controller and sets them as field variables. To refer to the instance o
+    SearchView in an instance $mycontroller of MyController we could use the
+    variable $mycontroller-&gt;searchView. For models, we would write
+    expressions like</p>
+<pre>
+    $mycontroller-&gt;mymodelnameModel
+</pre>
+    <p>and for plugins,</p>
+<pre>
+    $mycontroller-&gt;mypluginnamePlugin
+</pre>
+    <p>Notice in each expresseion the name of the
+    particular model or plugin is lower case. Given this way of referring
+    to models, a controller can invoke a models methods to get data out
+    of the file system or from a database with expressions like:</p>
+<pre>
+    $mycontroller-&gt;mymodelnameModel-&gt;someMethod();
+</pre>
+    <p>
+    In the above, if the code was within a method in the controller class
+    itself, we would typically write things like:</p>
+<pre>
+    $this-&gt;mymodelnameModel-&gt;someMethod();
+</pre>
+    </p>
+    A Controller must implement the abstract method
+    processRequest. The index.php script after finishing its bootstrap process
+    calls the processRequest method of the Controller it chose to
+    load. If this was your controller, the code in your controller
+    should make use of data gotten out of
+    the loaded models as well as data from the web request to do some
+    calculations. The results of these calculations you would typically
+    put into an associative array $data and then call the base Controller method
+    displayView($view, $data). Here $view is the whichever loaded view object
+    you would like to display.
+    </p>
+    <p>
+    To complete the picture of how Yioop! eventually produces a web page or
+    other output, we now describe how subclasses of the View class work.
+    Subclasses of View have four fields
+    $pages, $layout, $helpers, and $elements. In the base class, $pages,
+    $helpers, and $elements are empty arrays and the $layout is an empty
+    string. A subclass of View has at most one Layout and it is used
+    for rendering the header and footer of the page. It is included and
+    instantiated by setting $layout to be the name of the layout one wants to
+    load. For example, $layout="web"; would load either the
+    file APP_DIR/views/layouts/web_layout.php or
+    BASE_DIR/views/layouts/web_layout.php. This file is expected to have in it
+    a class WebLayout extending Layout. The contructor of a Layout
+    take as argument a view which it sets to an instance variabe.
+    The way layouts get drawn is
+    as follows: When the controller calls displayView($view, $data), this method
+    does some initialization and then calls the render($data) of the base
+    View class. This in turn calls the render($data) method of whatever
+    Layout was on the view. This render method then draws the header and then
+    calls $this->view->renderView($data); to draw the view, and finally
+    draws the footer.
+    </p>
+    <p>
+    The files loaded by the constructor of View for
+    each of $pages, $helpers, and $elements follows the same kind of pattern
+    as described above for Controller. The files loaded in the case of
+    $helpers as expected to be sub-classes of Helper and those of $elements
+    are expected to be sub-classes of Element. For helpers the given a view,
+    $view, with had $helpers = array("somehelper"); would get an instance
+    variable $view-&gt;somehelperHelper and similarly, for elements. Each
+    file loaded in because of the $pages array on the other is expected
+    to be a static portion of a web page in
+    WORK_DIRECTORY/locale/current-IANA-tag/pages.
+    For example, $pages=array("about"); would look for an about.thtml file
+    in this folder, load it and assign the string contents
+    to $page_objects["about"]. So using Yioop!'s shorthand for echo. A view
+    could render this page with the command:</p>
+    <pre>
+    e($this->page_objects["about"]);
+    </pre>
+    <p>
+    Element's have render($data) methods and can be used to draw out portions
+    of pages which may be common across Views. Helper's on the other hand
+    are used typically to render UI elements. For example, OptionsHelper
+    has a render($id, $name, $options, $selected) method and is used to
+    draw select drop-downs.
+    </p>
+    <p>When rendering a View or Element one often has css, scripts, images,
+    videos, objects, etc. In BASE_DIR, the targets of these tags would typically
+    be stored in the css, scripts, or resources folders.
+    The APP_DIR/css, APP_DIR/scripts, and APP_DIR/resources folder are
+    a natural place for them in your customized site. One wrinkle,
+    however, is that APP_DIR, unlike BASE_DIR, doesn't have to be under
+    your web servers DOCUMENT_ROOT. So how does one refer in a link
+    to these folders? To this one uses Yioop!'s ResourceController class
+    which can be invoked by a link like:</p>
+    <pre>
+    &lt;img src="?c=resource&a=get&n=myicon.png&f=resources" /&gt;
+    </pre>
+    <p>
+    Here c=resource specifies the controller, a=get specifies the activity --
+    to get a file, n=myicon.png specifies we want the file myicon.png --
+    the value of n is cleaned to make sure it is a filename before being used,
+    and f=resources specifies the folder -- f is allowed to be one of
+    css, script, or resources. This would get the file
+    APP_DIR/resources/myicon.png .
+    </p>
+    <p>
+    This completes our description of the Yioop! framework and how to
+    build a new site using it. It should be pointed out that code in
+    the APP_DIR can be localized using the same mechanism as in BASE_DIR.
+    More details on this can be found in the section on
+    <a href="#localizing">Localizing Yioop!</a>.
+    </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='embedding'>Embedding Yioop! in an Existing Site</h2>
     <p>One use-case for Yioop! is to use it to serve search result for your
-    site. There are three common ways to do this: (1)
+    existing site. There are three common ways to do this: (1)
     On your site have a web-form or links with your installation of Yioop!
     as their target and let Yioop! format the results. (2) Use the
     same kind of form or links, but request an OpenSearch RSS Response from
@@ -1600,7 +1919,7 @@ OdpRdfArchiveBundle
     access methods (2) or (3) and don't want users to be able to access the
     Yioop! search results via its built in web form. We will now spend a moment
     to look at each of these access methods in more detail...</p>
-    <h3>Accessing Yioop! via an Existing Web Form</h3>
+    <h3>Accessing Yioop! via a Web Form</h3>
     <p>A very minimal code snippet for such a
     form would be:</p>
     <pre>
@@ -1765,8 +2084,8 @@ xmlns:atom="http://www.w3.org/2005/Atom"

     <h3>Writing an Indexing Plugin</h3>
     <p>An indexing plugin provides a way that an advanced end-user
-    can extend the indexing capabilities of Yioop! Bundled with
-    Version 0.70 of Yioop! is an example recipe indexing plugin which
+    can extend the indexing capabilities of Yioop!. Bundled with
+    Yioop! is an example recipe indexing plugin which
     can serve as a guide for writing your own plugin. It is
     found in the folder lib/indexing_plugins. This recipe
     plugin is used to detect food recipes which occur on pages during a crawl.
@@ -1848,17 +2167,20 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     posting_doc_shards. arc_tool is run from the command-line with the syntaxes:
     </p>
     <pre>
-php arc_tool.php list //returns a list
-//of all the archives in the Yioop! crawl directory.
-
 php arc_tool.php info bundle_name //return info about
 //documents stored in archive.

-php arc_tool.php show bundle_name start num //outputs
-//items start through num from bundle_name
+php arc_tool.php list //returns a list
+//of all the archives in the Yioop! crawl directory.
+
+php arc_tool.php mergetiers bundle_name max_tier
+//merges tiers of word dictionary into one tier up to max_tier

 php arc_tool.php reindex bundle_name
 //reindex the word dictionary in bundle_name
+
+php arc_tool.php show bundle_name start num //outputs
+//items start through num from bundle_name
    </pre>
    <p>The bundle name can be a full path name, a relative path from
    the current directory, or it can be just the bundle directory's file
@@ -1928,8 +2250,14 @@ Final Merge Tiers

 Reindex complete!!
 </pre>
+<p>The mergetiers command is like a partial reindex. It assumes all the shard
+words have been added to the dictionary, but that the dictionary
+still has more than one tier (tiers are the result of incremental
+log-merges which are made during the crawling process). The
+mergetiers command merges these tiers into one large tier which is
+then usable by Yioop! for query processing.<p>
     <h3>Querying an Index from the command-line</h3>
-<p>    The command-line script bin/query_tool.php can be use to query
+<p>The command-line script bin/query_tool.php can be use to query
 indices in the Yioop! WORK_DIRECTORY/cache. This tool can be used
 on an index regardless of whether or not Apache is running. It can be
 used for long running queries that might timeout when run within a browser
@@ -2017,7 +2345,7 @@ MIT Press. 2010.</dd>
 OSDI'04: Sixth Symposium on Operating System Design and Implementation. 2004<dd>
 <dt id="GGL2003">[GGL2003]</dt>
 <dd>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
-<a href="http://labs.google.com/papers/mapreduce-osdi04.pdf
+<a href="http://research.google.com/archive/gfs-sosp2003.pdf
 ">The Google File System</a>.
 19th ACM Symposium on Operating Systems Principles. 2003.</dd>
 <dt id='H2002'>[H2002]</dt>
@@ -2086,7 +2414,7 @@ On the same website, there are <a
 href="http://snowball.tartarus.org/">stemmers for many other languages</a>.</dd>
 <dt id='PDGQ2006'>[PDGQ2006]</dt>
 <dd>Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan.
-<a href="http://labs.google.com/papers/sawzall-sciprog.pdf"
+<a href="http://research.google.com/archive/sawzall-sciprog.pdf"
 >Interpreting the Data: Parallel Analysis with Sawzall</a>.
 Scientific Programming Journal. Special Issue on Grids and Worldwide Computing
 Programming Models and Infrastructure.Volume 13. Issue 4. 2006. pp.227-298.</dd>
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 641a6ff..bfb852a 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,11 +2,11 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=0527f37f13a31be590ad11b6ea6b515b6ea2e0e8&hb=a7fb21c4b438ebf20f77abeb84313c022004aa0d&t=zip"
+     >Version 0.82-ZIP</a></li>
+</li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2d8b87c682039c1698b8c88b67e6bf45a6554efd&hb=83621ce1cf93dd62e22738fec4b3d3026dcc077c&t=zip"
     >Version 0.80-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=db58568d4957782dc85f875be3592b2b951e53a3&hb=d28a3af2b3574c17fb8425d340984fb02fcfb4a5&t=zip" >Version 0.78-ZIP</a></li>
-</li>
-
 </li>
 </ul>
 <h2>Git Repository / Contributing</h2>
@@ -20,6 +20,6 @@ Create/update an issue in the <a href="/mantis/">Yioop! issue tracker</a>
 describing what your patch solves and upload the patch. To contribute
 localizations, you can use the GUI interface in your own
 copy of Yioop! to enter in your localizations. Next locate in the locale
-folder of your Yioop! work directory, the locale tag of the
+folder of your Yioop! work directory the locale tag of the
 language you added translations for. Within this folder is a configure.ini
 file, just make an issue in the issue tracker and upload this file there.</p>
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index 7c050cb..360aa04 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -16,24 +16,23 @@ many web pages quickly, it is useful to have more than one machine when crawling
 the web. If you have several machines at home, simply install the software
 on all the machines you would like to use in a web crawl. In the configuration
 interface give the URL of the machine you would like to serve search results
-from. Start the queue server on that machine and start fetchers on each of the
-other machines.</li>
-<li><b>Be fast and online.</b> The Yioop is "online" in the
+from. Start at least one queue server and as many fetchers as desired on
+the other machines.</li>
+<li><b>Be fast and online.</b> Yioop is "online" in
 that it creates a word index and document ranking as it crawls rather
 than ranking as a separate step. This keeps the processing done by any
 machine as low as possible so you can still use them for what you bought them
-for. Nevertheless, it is reasonably fast: four Lenova Q100 fetchers and
-a 2006 MacMini queue server can crawl and index a million pages every couple
-days. A single 2010 Mac Mini running four fetchers on the same machine
-can also achieve this rate. More fetchers, of course, allows for faster crawls
+for. Nevertheless, it is reasonably fast: A set-up consisting of two Mac Mini's
+each with 8GB RAM, a queue_server, and six fetchers can be
+reasonably be expected to crawl around 2 million pages/day.
 </li>
 <li><b>Make it easy to archive crawls.</b> Crawls are stored in timestamped
-folders, which can be moved around zipped, etc. Through the admin interface you
+folders that can be moved around zipped, etc. Through the admin interface you
 can select amongst crawls which exist in a crawl folder as to which crawl you
 want to serve from.</li>
 <li><b>Make it easy to crawl archives.</b> There are many sources of
 raw web data available today such as files that use the Internet Archive's
 arc format, Open Directory Project RDF data, Wikipedia xml dumps, etc. Yioop!
 can index these formats directly, allowing one to get an index for these
-high-value content sites without needing to do an exhaustive crawl.</li>
+high-value sites without needing to do an exhaustive crawl.</li>
 </ul>
ViewGit