Revising docs for version 0.88, a=chris

Chris Pollett [2012-06-26 07:Jun:th]

Revising docs for version 0.88, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index c4a3293..b50a4ff 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -1,6 +1,6 @@
 <h1>About SeekQuarry/Yioop!</h1>
 <p>SeekQuarry is the parent site for <a href="http://www.yioop.com/">Yioop!</a>.
-Both SeekQuarry and Yioop! were written by <a
+Both SeekQuarry and Yioop! were written mainly by myself, <a
 href="http://www.cs.sjsu.edu/faculty/pollett">Chris Pollett</a>. The project
 began in Nov. 2009 and had its first publically available release in August,
 2010.
@@ -54,20 +54,27 @@ with localization: Mary Pollett, Jonathan Ben-David,
 Thanh Bui, Sujata Dongre, Animesh Dutta,
  Youn Kim, Akshat Kukreti, Vijeth Patil, Chao-Hsin Shih,
 and Sugi Widjaja. Thanks to
-Ravi Dhillon Tanmayee Potluri, Shawn Tice, and Sandhya Vissapragada for
-creating patches for Yioop! issues.Several of my master's students have done
-projects related to Yioop!: Amith Chandranna, Priya Gangaraju, Vijaya Pamidi and
-Vijaya Sinha. Amith's code related to an Online version of the HITs algorithm
-is not currently in the main branch of Yioop!, but it is
+Ravi Dhillon, Tanmayee Potluri, Shawn Tice, and Sandhya Vissapragada for
+creating patches for Yioop! issues. Several of my master's students have done
+projects related to Yioop!: Amith Chandranna, Priya Gangaraju, Vijaya Pamidi,
+Vijeth Patil, and Vijaya Sinha. Amith's code related to an Online version of
+the HITs algorithm is not currently in the main branch of Yioop!, but it is
 obtainable from <a href="http://www.cs.sjsu.edu/faculty/pollett/masters/<?php
 ?>Semesters/Spring10/amith/index.shtml">Amith Chandranna's student page</a>.
 Vijaya Pamidi developed a Firefox web traffic extension for Yioop!
 Her code is also obtainable from <a href="http://www.cs.sjsu.edu/faculty/<?php
 ?>pollett/masters/Semesters/Fall10/vijaya/index.shtml">Vijaya Pamidi's
 master's pages</a>. <a href="http://www.cs.sjsu.edu/faculty/pollett/<?php
+?>masters/Semesters/Fall11/vijeth/index.shtml">Vijeth Patil's Project</a>
+involved adding support for Twitter and RSS feeds to add additional real-time
+search results to the standard search results. This is not currently in main
+branch. <a href="http://www.cs.sjsu.edu/faculty/pollett/<?php
 ?>masters/Semesters/Spring11/amith/index.shtml">Vijaya Sinha's Project</a>
-concerned using Open Street Map data in Yioop!. Priya's code served as the
-basis for the plugin feature currently in Yioop! The following other
+concerned using Open Street Map data in Yioop!. This code is not currently
+in the main branch. Priya's code served as the
+basis for the plugin feature currently in Yioop! Shawn Tice's CS288
+project served as the basis of a rewrite of the archive crawl feature of Yioop!
+for the multi-queue server setting. The following other
 students have  created text processors for Yioop!: Nakul Natu (pptx),
 Vijeth Patil (epub), and Tarun Pepira (xslx).
 </p>
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index e04b34d..f9db41e 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.86</h1>
+<h1>Yioop! Documentation v 0.88</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -7,6 +7,7 @@
         <li><a href="#installation">Installation and Configuration</a></li>
         <li><a href="#files">Summary of Files and Folders</a></li>
         <li><a href="#interface">The Yioop! Search and User Interface</a></li>
+        <li><a href="#mobile">Yioop! Mobile Interface</a></li>
         <li><a href="#passwords">Managing Accounts</a></li>
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
@@ -26,8 +27,9 @@
     <h2 id="intro">Introduction</h2>
     <p>The Yioop! search engine is designed to allow users
     to produce indexes of a web-site or a collection of
-    web-sites whose total number of pages are in the tens or low hundreds
-    of millions. In contrast, a search-engine like Google maintains an index
+    web-sites. The number of pages a Yioop index can handle range from small
+    site to those containing tens or hundreds of millions of pages. In contrast,
+    a search-engine like Google maintains an index
     of tens of billions of pages. Nevertheless, since you, the user, have
     control over the exact sites which are being indexed with Yioop!, you have
     much better control over the kinds of results that a search will return.
@@ -69,8 +71,9 @@
     is to use a stand-alone full text index server such as <a
     href="http://www.sphinxsearch.com/">Sphinx</a>. However, for these
     approaches to work the text you are indexing needs to be in a database
-    column or columns. Nevertheless, these approaches illustrate another
-    common thread in the development of search systems: search as an appliance,
+    column or columns, or have an easy to define "XML mapping". Nevertheless,
+    these approaches illustrate another
+    common thread in the development of search systems: Search as an appliance,
     where you either have a separate search server and access it through either
     a web-based API or through function calls. Yioop! has both a search
     function API as well as a web API that returns
@@ -169,8 +172,8 @@
     <b>queue servers</b> that perform scheduling and indexing jobs, as well as
     <b>fetcher</b> processes which are responsible for downloading pages.
     Through the name server's web app, users can send messages to the
-    queue_servers and fetchers. This interface writes message
-    files that queue_servers periodically looks for. Fetcher processes
+    queue servers and fetchers. This interface writes message
+    files that queue servers periodically looks for. Fetcher processes
     periodically ping the name server to find the name of the current crawl
     as well as a list of queue servers. Fetcher programs then periodically
     make requests in a round-robin fashion to the queue servers for messages
@@ -294,7 +297,7 @@
     Like Nutch and Heritrix, Yioop! also has a command-line tool for quickly
     looking at the contents of such archive objects.
     </p>
-    <p>The <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC
+    <p>The <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC
     format</a> is one example of an archival file format for web data. Besides
     at the Internet Archive, ARC and its successor
     <a href="
@@ -303,7 +306,7 @@
     as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the
     <a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">ClueWeb Dataset</a>.
     In addition, it was used by grub.org (hopefully, only on a
-    temporary hiatus), a distributed, open-source, search engine project in C#.
+    temporary hiatus), a distributed, open-source, search engine project in C#.
     Another important format for archiving web pages is the XML format used by
     <a href="http://www.wikipedia.org/">Wikipedia</a> for archiving MediaWiki
     wikis. Wikipedia offers <a
@@ -388,7 +391,7 @@
     <li>Yioop! has a web form that allows a user to control the recrawl
     frequency for a page during a crawl.</li>
     <li>Yioop! has a web form that allows users to specify meta words
-    to be injected into an index based on whether a downloaded document matches
+    to be injected into an index based on whether a downloaded document matches
     a url pattern.</li>
     <li>Yioop! uses a web archive file format which makes it easy to
     copy crawl results amongst different machines. It has a command-line
@@ -414,23 +417,26 @@
     <li>Besides standard output of a web page with ten links it is possible
     to get query results in Open Search RSS format and also to query
     Yioop! data via a function api.</li>
+    <li>Yioop! has been optimized to work well with smart phone web browsers
+    and with tablet devices.</li>
+    <li>Yioop! has built-in support for image and video specific search</li>
     </ul>
     <p><a href="#toc">Return to table of contents</a>.</p>

     <h2 id="requirements">Requirements</h2>
-    <p>The Yioop! search engine requires: (1) a web server, (2) PHP 5.3 or
+    <p>The Yioop! search engine requires: (1) a web server, (2) PHP 5.3 or
     better (Yioop! used only to serve search results from a pre-built index
-    has been tested to work in PHP 5.2), (3) Curl libraries for downloading
-    web pages. To be a little more specific Yioop! has been tested with
-    Apache 2.2 and I've been told Version 0.82 or newer works with lighttpd.
+    has been tested to work in PHP 5.2), (3) Curl libraries for downloading
+    web pages. To be a little more specific Yioop! has been tested with
+    Apache 2.2 and I've been told Version 0.82 or newer works with lighttpd.
     It should work with other webservers, although it might take some
     finessing. For PHP,
-    you need a build of PHP that incorporates multi-byte string (mb_ prefixed)
-    functions, Curl, Sqlite (or at least PDO with Sqlite driver),
-    the GD graphics library and the command-line interface. If you are using
-    Mac OSX Snow Leopard or Lion, the version of Apache2 and PHP that come
-    with it suffice. For Windows, Mac, and Linux, another easy way to get the
-    required software is to download a Apache/PHP/MySql suite such as
+    you need a build of PHP that incorporates multi-byte string (mb_ prefixed)
+    functions, Curl, Sqlite (or at least PDO with Sqlite driver),
+    the GD graphics library and the command-line interface. If you are using
+    Mac OSX Snow Leopard or Lion, the version of Apache2 and PHP that come
+    with it suffice. For Windows, Mac, and Linux, another easy way to get the
+    required software is to download a Apache/PHP/MySql suite such as
     <a href="http://www.apachefriends.org/en/xampp.html">XAMPP</a>. On Windows
     machines, find the the php.ini file under the php folder in your Xampp
     folder and change the line:</p>
@@ -542,7 +548,7 @@ page looks like:
 For this step, as a security precaution, you must connect via localhost. If you
 are in a web hosting environment (for example, if you are using cPanel
 to set up Yioop!) where it is difficult to connect using localhost, you can
-add a file, configs/local_configs.php, with the following content:</p>
+add a file, configs/local_config.php, with the following content:</p>
 <pre>
 &lt;?php
 define('NO_LOCAL_CHECK', 'true');
@@ -600,9 +606,9 @@ in the examples folder at the file search_api.php to see an example
 of how to use it. <b>If you intend to use Yioop!
 in a configuration with multiple queue servers (not fetchers), then
 the RSS check box needs to be checked.</b></p>
-<p>The <b>Database Set-up</b> fieldset is used to specify what database management
-system should be used, how it should be connected to, and what user name
-and password should be used for the connection. At present sqlite2
+<p>The <b>Database Set-up</b> fieldset is used to specify what database
+management system should be used, how it should be connected to, and what
+user name and password should be used for the connection. At present sqlite2
 (called just sqlite), sqlite3, and Mysql databases are supported. The
 database is used to store information about what users are allowed to
 use the admin panel and what activities and roles these users have. Unlike
@@ -624,11 +630,12 @@ changing database information you might have to sign in again.
 you which element and links you would like to have presented on the search
 landing and search results pages. The Word Suggest check box controls whether
 a drop down of word suggestions should be presented by Yioop! when a user
-starts typing in the Search box.The Signin checkbox controls whether to display
-the link to the page for users to sign in to Yioop!  The Cache checkbox toggles
-whether a link to the cache of a search item should be displayed as part
-of each search result. The Similar checkbox toggles
-whether a link to similar search items should be displayed as part
+starts typing in the Search box. The Subsearch checkbox controls whether the
+links for Image and Video search appear in the top bar of Yioop! The Signin
+checkbox controls whether to display the link to the page for users to sign in
+to Yioop!  The Cache checkbox toggles whether a link to the cache of a search
+item should be displayed as part of each search result. The Similar checkbox
+toggles whether a link to similar search items should be displayed as part
 of each search result. The Inlinks checkbox toggles
 whether a link for inlinks to a search item should be displayed as part
 of each search result. Finally, the IP address checkbox toggles
@@ -747,11 +754,11 @@ not strictly necessary as the database should be creatable via the admin panel;
 however, it can be useful if the database isn't working for some reason.
 Also, in the configs folder is the file default_crawl.ini. This file is
 copied to WORK_DIRECTORY after you set this folder in the admin/configure panel.
-There it is renamed as crawl.ini and serves as the initial set of sites to crawl
+There it is renamed as crawl.ini and serves as the initial set of sites to crawl
 until you decide to change these. The file token_tool.php is a tool which can
 be used to help in term extraction during crawls and for making trie's
 which can be used for word suggestions for a locale. To help word extraction
-this tool can generate in a locale folder (see below) a word gram bloom filter.
+this tool can generate in a locale folder (see below) a word gram bloom filter.
 Word grams are sequences of words that should be treated as a unit, for example,
 Honda Accord. token_tool.php can use either a raw Wikipedia page count dump
 file, or an actual Wikipedia dump file to extract from titles or redirects
@@ -904,7 +911,7 @@ to say what it has just crawled, the web app writes data into these
 folders to be processed later by the queue_server. The UNIX_TIMESTAMP
 is used to keep track of which crawl the data is destined for. IndexData
 folders contain mini-inverted indexes (word document records) which are
-to be added to the global inverted index (called the dictionary) for that crawl.
+to be added to the global inverted index (called the dictionary) for that crawl.
 RobotData folders contain information that came from robots.txt files.
 Finally, ScheduleData folders have data about found urls that could
 eventually be scheduled to crawl. Within each of these three kinds of folders
@@ -984,6 +991,33 @@ on the page are used to locate the page via a search query. This
 can be viewed as an "SEO" view of the page.</p>
 <img src='resources/CacheSEO.png' alt='Example Cache SEO Results'
 width="70%"/>
+<p>In addition, to a straightforward web search, one can also do image and
+video search by clicking on the Images or Video link in the top bar
+of Yioop search pages. Below are some examples of what these look like
+for a search on "Obama":</p>
+<img src='resources/ImageSearch.png' alt='Example Image Search Results'
+width="70%"/>
+<img src='resources/VideoSearch.png' alt='Example Video Search Results'
+width="70%"/>
+<p>When Yioop! crawls a page it adds one of the following meta
+words to the page media:text, media:image, or media:video. A usual
+web search just takes the search terms provided to perform a search.
+An Images or Video search tacks on to the search terms, media:image or
+media:video. Detection of images is done via mimetype at initial page download
+time. At this time a thumbnail is generated. When search results are presented
+it is this cached thumbnail that is shown. So image search does not leak
+information to third party sites. On any search results page with images,
+Yioop! tries to group the images into a thumbnail strip. This is true of
+both normal and images search result pages. In the case of image search result
+pages, except for not-yet-downloaded pages, this results in almost all of
+the results being the thumbnail strip. Video page detection is not done
+through mimetype as popular sites like YouTube, Vimeo, and others vary in
+how they use Flash or video tags to embed video on a web page. Yioop!
+uses the format of the URL from particular web sites to guess if the page
+contains a video or not. To get a thumbnail for the video it uses the
+API of the particular site in question. <b>This could leak information to third
+party sites about your search.</b>
+</p>
 <p>A basic query to the Yioop! search form is typically a sequence of
 words seperated by whitespace. This will cause Yioop! to compute a
 "conjunctive query", it will look up only those documents which contain all of
@@ -1071,9 +1105,9 @@ is useful for checking if a particular page is in the index.
 whose language can be determined to match the given language tag.
 For example, <i>lang:en-US</i>.</li>
 <li><b>media:kind</b> returns summaries of all documents found
-of the given media kind. Currently, the text and images are the two
+of the given media kind. Currently, the text, image, and video are the three
 supported media kinds. So one can add to the
-search terms <em>media:images</em> to get only image results matching
+search terms <em>media:image</em> to get only image results matching
 the query keywords.</li>
 <li><b>mix:name</b> or <b>m:name</b> tells Yioop! to use the crawl mix "name"
 when computing the results of the query. The section on mixing crawl indexes has
@@ -1084,6 +1118,15 @@ the spaces with plusses, <i>m:cool+mix</i>.</li>
 returns summaries of all documents which were last modified on the given date.
 For example, <i>modified:2010-02</i> returns all document which were last
 modifed in February, 2010.</li>
+<li><b>no:some_command</b> is used to tell Yioop! not to perform some
+default transformation of the search terms. For example, <i>no:guess</i>
+tells Yioop! not to try to guess the semantics of the search before
+doing the search. This would mean for instance, that Yioop! would not
+rewrite the query <i>yahoo.com</i> into <i>site:yahoo.com</i>.
+<i>no:network</i> tells Yioop! to only return search results from the
+current machine and not to send the query to all machines in the Yioop!
+instance. <i>no:cache</i> says to recompute the query and not to make
+use of memcache or file cache.</li>
 <li><b>numlinks:some_number</b> returns summaries of all documents
 which had some_number of outgoing links. For example, numlinks:5.</li>
 <li><b>os:operating_system</b>  returns summaries of all documents
@@ -1093,6 +1136,14 @@ served on servers using the given operating system. For example,
 whose path component begins with path_component_of_url. For example,
 path:/phpBB would return all documents whose path started with phpBB,
 path:/robots.txt would return summaries for all robots.txt files.</li>
+<li><b>robot:user_agent_name</b> returns robots.txt pages that contained
+that user_agent_name (after lower casing). For example, <i>robot:yioopbot</i>
+would return all robots.txt pages explicitly having a rule for YioopBot.</li>
+<li><b>safe:boolean_value</b> is used provide "safe" or "unsafe"
+search results. Yioop! has a crude, "hand-tuned", linear classifier for
+whether a site contains pornographic content. If one adds safe:true to
+a search, only those pages found which were deemed non-pornographic will
+be returned. Adding safe:false has the opposite effect.</li>
 <li><b>server:web_server_name</b> returns summaries of all documents
 served on that kind of web server. For example, <i>server:apache</i>.</li>
 <li><b>site:url</b>, <b>site:host</b>, or <b>site:domain</b> returns all of
@@ -1171,8 +1222,31 @@ Currently, for the root account, the Activity element looks like:
 <img src='resources/ActivityElement.png' alt='The Activity Element'/>
 <p>
 Over the next several sections we will discuss each of the Yioop! admin
-activities in turn.
+activities in turn. Before we do that we make a couple remarks about using
+Yioop! from a mobile device.
 </p>
+    <h2 id='mobile'>Yioop! Mobile Interface</h2>
+    <p>Yioop!'s user interface is designed to display reasonably well as is
+    in table devices such as the iPad. For smart phones, such as
+    iPhone, Android, Blackberry, or Windows Phone, Yioop! has a separate
+    user interface. For search, settings, and login, this looks fairly
+    similar to the non-mobile user interface:</p>
+<img src='resources/MobileSearch.png' alt='Mobile Search Landing Page'
+    style="width:280px; height:280px"/>
+<img src='resources/MobileSettings.png' alt='Mobile Settings Page'
+    style="width:280px;height:280px"/>
+<img src='resources/MobileSignin.png' alt='Mobile Admin Panel Login'
+    style="width:280px;height:280px"/>
+    <p>For Admin pages, each activity is controlled in an analgous fashion
+    to the non-mobile setting, but the Activity element has been replaced
+    with a drop-down:</p>
+<img src='resources/MobileAdmin.png' alt='Example Mobile Admin Activity'
+    style="width:280px;height:280px"/>
+    <p>We now resume our discussion of how to use each of the Yioop! admin
+    activities for the default, non-mobile, setting, simply noting that
+    except for the above minor changes, these instructions will also apply to
+    the mobile setting.
+    </p>
     <h2 id='passwords'>Managing Accounts</h2>
     <p>By default, when a user first signs in to the Yioop! admin
     panel the current activity is the Manage Account activity. For now,
@@ -1415,7 +1489,8 @@ php fetcher.php start 5
     </p>
     <p>
     The format for sites, domains, and urls are the same for each of these
-    textareas, except that the Seed site area can only take urls and in
+    textareas, except that the Seed site area can only take urls (or urls
+    and title/descriptions)  and in
     the Disallowed Sites/Sites with Quotas one can give a url
     followed by #. Otherwise,
     in this common format, there should be one site, url, or domain per
@@ -1436,10 +1511,12 @@ php fetcher.php start 5
     Such a site includes https://www.somewhere.com/foo/anything_more .
     Yioop! also recognizes * and $ within urls. So http://my.site.com/*/*/
     would match http://my.site.com/subdir1/subdir2/rest and
-    http://my.site.com/*/*/$ would require the last symbold in the url
+    http://my.site.com/*/*/$ would require the last symbol in the url
     to be '/'. This kind of pattern matching can be useful in the
-    Allowed To Crawl Sites area to restrict the depth of a crawl to
-    within a url to a certain fixed depth.</p>
+    to restrict the depth of a crawl to
+    within a url to a certain fixed depth -- you can allow crawling a site,
+    but disallow the downloading of pages with more than a certain number of
+    `/' in them.</p>
     <p>In the Disallowed Sites/Sites with Quotas, a number after a # sign
     indicates that at most that many
     pages should be downloaded from that site in any given hour. For example,
@@ -1449,6 +1526,18 @@ php fetcher.php start 5
     </pre>
     <p>indicates that at most 100 pages are to be downloaded from
     http://www.ucanbuyart.com/ per hour.</p>
+    <p>In the seed site area one can specify title and page descriptions
+    for pages that Yioop! would otherwise be forbidden to crawl by the
+    robots.txt file. For example,</p>
+    <pre>
+http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
+    </pre>
+    <p>tells Yioop! to generate a placeholder page for
+    http://www.facebook.com/ with title "Facebook" and description
+    "A famous social media site" rather than to attempt to download
+    the page. The <a href="#editor">Results Editor</a> activity can only
+    be used to affect pages which are in a Yioop! index. This technique
+    allows one to add arbitrary pages to the index.</p>
     <p>When configuring a new instance of Yioop! the file default_crawl.ini
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
     for the Options form. </p>
@@ -1495,7 +1584,8 @@ php fetcher.php start 5
     You might want to re-crawl an existing Yioop! crawl if you want to add
     new meta-words or if you are migrating a crawl from an older version
     of Yioop! for which the index isn't readable by your newer version of
-    Yioop! You might want to do an archive crawl of other file formats
+    Yioop! (You can even re-recrawl if you want).
+    You might want to do an archive crawl of other file formats
     if you want Yioop! to be able to provide search results of their content.
     Once you have selected the archive you want to crawl, you can add meta
     words as discussed in the previous section and then save your options
@@ -1505,32 +1595,40 @@ php fetcher.php start 5
     an archive that was made with several fetchers, each of the fetchers
     that was used in the creation process should be running.</p>
     <p>To get Yioop to detect arc, MediaWiki, and ODP RDF files you need
-    to create an PROFILE_DIR/cache/IndexData(timestamp) folder on the queue
-    server machine containing the single file arc_description.txt. This
-    text file's contents should just be the name you would like for your
-    data. In the Archive Crawl drop-down this name will appear with the
+    to create an PROFILE_DIR/cache/archives folder on the name
+    server machine. Yioop! checks subfolders of this for
+    files with the name arc_description.ini. For example, to do a Wikimedia
+    archive crawl, one could make a subfolder
+    PROFILE_DIR/cache/archives/my_wiki_media_files and put in it a
+    file arc_description.ini in the format to be discussed in a moment.
+    The arc_description.ini file's contents are used to give a description
+    for the archive crawl that will be displayed in the archive drop-down
+    as well as specify the kind of archives the folder contains. An
+    example arc_description.ini might look like:</p>
+    <pre>
+arc_type = 'MediaWikiArchiveBundle';
+description = 'English Wikipedia 2012';
+    </pre>
+    <p>In the Archive Crawl drop-down the description will appear with the
     prefix ARCFILE:: and you can then select it as the source to crawl.
-    To actually crawl anything though for each fetcher machine that you would
-    like to take part in the archive crawl, you should make a folder
-    PROFILE_DIR/cache/Archive(same_timestamp). In this folder you should
-    have a text file arc_type.txt saying what kind of archive bundle this is.
-    If you want to archive crawl arc files you would have a single line:</p>
+    Currently, there are three supported arc_types. For folders containing
+    file in Internet Archive arc format one can use:</p>
     <pre>
 ArcArchiveBundle
     </pre>
-    <p>For Media Wiki xml, the line would be:</p>
+    <p>For Media Wiki xml, one uses the arc_type:</p>
     <pre>
 MediaWikiArchiveBundle
     </pre>
-    <p>And for Open Directory RDF, the line would be:</p>
+    <p>And for Open Directory RDF, the arc_type would be:</p>
     <pre>
 OdpRdfArchiveBundle
     </pre>
-    <p>Then in this folder (not a subdirectory thereof) you would also put
-    instances of the files in question that you would like to archive crawl.
-    So for arc files, these would be files of extension .arc.gz; for MediaWiki,
-    files of extension .xml.bz2; and for ODP-RDF, files of extension .rdf.u8.gz
-    .
+    <p>In addition, to the arc_description.ini file, remember that the subfolder
+    should also contain instances of the files in question that you would like
+    to archive crawl. So for arc files, these would be files of extension
+    .arc.gz; for MediaWiki, files of extension .xml.bz2;
+    and for ODP-RDF, files of extension .rdf.u8.gz .
     </p>

     <p><a href="#toc">Return to table of contents</a>.</p>
@@ -1630,12 +1728,18 @@ OdpRdfArchiveBundle
     can select it and click load to get it to display in the url editing
     form. The purpose of the url editing form is to allow a user to change
     the title and description for a url that appears on a search results
-    page. It does not affect whether the page is looked up for a given query,
-    only its final appearance. By filling out the three fields of the
+    page. By filling out the three fields of the
     url editing form, or by loading values into them through the previous form
-    and changing them, then clicking save updates the appearance of the summary
-    for that url. To return to using the default summary, one only fills
+    and changing them, and then clicking save, updates the appearance of the
+    summary for that url. To return to using the default summary, one only fills
     out the url field, leaves the other two blank, and saves.
+    This form does not affect whether the page is looked up for a given query,
+    only its final appearance. It can only be used to edit the appearance
+    of pages which appear in the index, not to add pages to the index. Also,
+    the edit will affect the appearance of that page for all indexes managed
+    by Yioop! If you know there is a page that won't be crawled by
+    Yioop!, but would like it to appear in an index, please look at the crawl
+    options section of <a href="#crawls">Manage Crawls</a> documentation.
     </p>
     <p>To understand the filter websites form, recall the disallowed sites
     crawl option allows a user to specify they
@@ -1758,14 +1862,14 @@ OdpRdfArchiveBundle
     If you make a set of translations, be sure to submit the form associated
     with this table by scrolling to the bottom of the page and clicking the
     Submit link. This saves your translations; otherwise, your work will be
-    lost if you navigate away from this page. One aid to translating is if you
-    hover your mouse over a field that needs translation, then its translation
+    lost if you navigate away from this page. One aid to translating is if you
+    hover your mouse over a field that needs translation, then its translation
     in the default locale (usually English) is displayed. If you want to find
     where in the source code a string id comes from the ids follow
     the rough convention file_name_approximate_english_translation.
     So you would expect to find admin_controller_login_successful
     in the file controllers/admin_controller.php . String ids with the
-    prefix db_ (such as the names of activities) are stored in the database.
+    prefix db_ (such as the names of activities) are stored in the database.
     So you cannot find these ids in the source code. The tooltip trick
     mentioned above does not work for database string ids.</p>

@@ -1786,7 +1890,7 @@ OdpRdfArchiveBundle
     Algorithm [<a href="#P1980">P1980</a>]. This stemmer is located in the
     file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php .
     The [<a href="#P1980">P1980</a>] link
-    points to a site that has source code for stemmers for many other languages
+    points to a site that has source code for stemmers for many other languages
     (unfortunately,  not written in PHP). It would not be hard to port these
     to PHP and then add modify the tokenizer.php file of the
     appropriate locale folder. For instance, one
@@ -1838,7 +1942,8 @@ OdpRdfArchiveBundle
     token_tool.php is run from the command line as:
     </p>
     <pre>
-    php token_tool.php filter wiki_file lang locale n extract_type max_to_extract
+    php token_tool.php filter wiki_file lang locale n extract_type <?php
+    ?>max_to_extract
     </pre>
     <p>
     where wiki_file is a wikipedia xml file or a bz2  compressed xml file whose
@@ -2103,12 +2208,14 @@ OdpRdfArchiveBundle
     other kinds of queries: Related sites queries and cache look-up queries.
     The related query format is:</p>
 <pre>
-    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;a=related&amp;arg=URL
+    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;<?php
+    ?>a=related&amp;arg=URL
 </pre>
     <p>where URL is the url that you are looking up related URLs for. To do a
     look up of the Yioop! cache of a web page the url format is:</p>
 <pre>
-    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;q=QUERY&amp;a=cache&amp;arg=URL
+    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;<?php
+    ?>q=QUERY&amp;a=cache&amp;arg=URL
 </pre>
     <p>Here the terms listed in QUERY will be styled in different colors in the
     web page that is returned; URL is the url of the web page you want to look
@@ -2128,7 +2235,8 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     &lt;channel&gt;
         &lt;title&gt;PHP Search Engine - Yioop! : art&lt;/title&gt;
         &lt;language&gt;en-US&lt;/language&gt;
-        &lt;link&gt;http://localhost/git/yioop/?f=rss&amp;amp;q=art&amp;amp;its=1317152828&lt;/link&gt;
+        &lt;link&gt;http://localhost/git/yioop/?f=rss&amp;amp;q=art&amp;<?php
+    ?>amp;its=1317152828&lt;/link&gt;
         &lt;description&gt;Search results for: art&lt;/description&gt;
         &lt;opensearch:totalResults&gt;1105&lt;/opensearch:totalResults&gt;
         &lt;opensearch:startIndex&gt;0&lt;/opensearch:startIndex&gt;
@@ -2325,7 +2433,8 @@ php arc_tool.php info bundle_name //return info about
 //documents stored in archive.

 php arc_tool.php list //returns a list
-//of all the archives in the Yioop! crawl directory.
+//of all the archives in the Yioop! crawl directory, including
+//non-Yioop! archives in the cache/archives sub-folder.

 php arc_tool.php mergetiers bundle_name max_tier
 //merges tiers of word dictionary into one tier up to max_tier
@@ -2335,6 +2444,7 @@ php arc_tool.php reindex bundle_name

 php arc_tool.php show bundle_name start num //outputs
 //items start through num from bundle_name
+//or name of non-Yioop archive crawl folder.
    </pre>
    <p>The bundle name can be a full path name, a relative path from
    the current directory, or it can be just the bundle directory's file
@@ -2343,13 +2453,22 @@ php arc_tool.php show bundle_name start num //outputs
    They are not all from the same session:</p>
    <pre>
 |chris-polletts-macbook-pro:bin:108&gt;php arc_tool.php list
-Archive1191586964
-IndexData1191586964
+Found Yioop Archives:
+=====================
+0-Archive1334468745
+0-Archive1336527560
+IndexData1334468745
+IndexData1336527560
+
+Found Non-Yioop Archives:
+=========================
+english-wikipedia2012
 chris-polletts-macbook-pro:bin:109&gt;

 ...

-|chris-polletts-macbook-pro:bin:158&gt;php arc_tool.php info /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731
+|chris-polletts-macbook-pro:bin:158&gt;php arc_tool.php info <?php
+?>/Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731

 Bundle Name: IndexData1293767731
 Bundle Type: IndexArchiveBundle
@@ -2372,7 +2491,8 @@ Meta Words:
    http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/

 |chris-polletts-macbook-pro:bin:159&gt;
-|chris-polletts-macbook-pro:bin:202&gt;php arc_tool.php show /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3
+|chris-polletts-macbook-pro:bin:202&gt;php arc_tool.php show <?php
+?>/Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3

 BEGIN ITEM, LENGTH:21098
 [URL]
@@ -2394,7 +2514,8 @@ ASCII
    &lt;/pre&gt;
 ...

-|chris-polletts-macbook-pro:bin:117&gt;php arc_tool.php reindex IndexData1317414152
+|chris-polletts-macbook-pro:bin:117&gt;php arc_tool.php reindex <?php
+?>IndexData1317414152

 Shard 0
 [Sat, 01 Oct 2011 11:05:17 -0700] Adding shard data to dictionary files...
@@ -2429,9 +2550,11 @@ The following shows how one could do a query on "Chris Pollett":

 ============
 TITLE: ECCC - Pointers to
-URL: http://eccc.hpi-web.de/static/pointers/personal_www_home_pages_of_complexity_theorists/
+URL: http://eccc.hpi-web.de/static/pointers/<?php
+?>personal_www_home_pages_of_complexity_theorists/
 IPs: 141.89.225.3
-DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located
+DESCRIPTION: Homepage of the Electronic Colloquium on Computational <?php
+?>Complexity located
 at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of
 complexity people 2011 2010 2009 2011...1994 POINTE
 Rank: 3.9551158411
@@ -2442,9 +2565,11 @@ Score: 4.14

 ============
 TITLE: ECCC - Pointers to
-URL: http://www.eccc.uni-trier.de/static/pointers/personal_www_home_pages_of_complexity_theorists/
+URL: http://www.eccc.uni-trier.de/static/pointers/<?php
+?>personal_www_home_pages_of_complexity_theorists/
 IPs: 141.89.225.3
-DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located
+DESCRIPTION: Homepage of the Electronic Colloquium on Computational <?php
+?>Complexity located
 at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of
 complexity people 2011 2010 2009 2011...1994 POINTE
 Rank: 3.886318974
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index f7fbb82..a1c4548 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,10 +2,10 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=1be2b50b8436998ce8d2d41f5db3b470610aa817&hb=6fc863b1aaf26d8a0abf49a2aad9c7ce440ea307&t=zip"
+    >Version 0.88-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2bb7f54c7f52d4eebf605430400088de1c0505cf&hb=876e9b0380d96d975d55cbcf11fbfd1ad03a6278&t=zip"
     >Version 0.861-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2941365a784374115a3c64af686ce12c17d28cd8&hb=7d142c2546364229e4796f6a77cdda66285b6ed5&t=zip"
-    >Version 0.84-ZIP</a></li>
 </ul>
 <h2>Git Repository / Contributing</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index e115b39..5dd23e4 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -23,8 +23,8 @@ that it creates a word index and document ranking as it crawls rather
 than ranking as a separate step. This keeps the processing done by any
 machine as low as possible so you can still use them for what you bought them
 for. Nevertheless, it is reasonably fast: A test set-up consisting of three
-Mac Mini's each with 8GB RAM, a queue_server, and six fetchers was able
-to crawl 100 million pages in 5 weeks.
+Mac Mini's each with 8GB RAM, a queue_server, and five fetchers adds a
+100 million pages to its index every four weeks.
 </li>
 <li><b>Make it easy to archive crawls.</b> Crawls are stored in timestamped
 folders that can be moved around zipped, etc. Through the admin interface you

ViewGit