Updated docs for version 0.94

Chris Pollett [2013-04-05 03:Apr:th]

Updated docs for version 0.94

Filename
en-US/pages/coding.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml
en-US/pages/install.thtml

diff --git a/en-US/pages/coding.thtml b/en-US/pages/coding.thtml
index 0b8c4b1..288239f 100755
--- a/en-US/pages/coding.thtml
+++ b/en-US/pages/coding.thtml
@@ -101,7 +101,7 @@
         "http://en.wikipedia.org/wiki/Byte_order_mark">byte order mark</a>.</li>
     <li>All non-binary files in Yioop should follow the convention of using
     four spaces for tabs (rather than tab characters). Further,
-    all lines should be less than  or equal to80 columns in length.
+    all lines should be less than or equal to 80 columns in length.
     Lines should not end with
     trailing white-space characters. It is recommended to use an
     editor which can display white-space characters and which can display
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index ea476d5..d574864 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,9 +1,10 @@
 <div class="docs">
-<h1>Yioop Documentation v 0.92</h1>
+<h1>Yioop Documentation v 0.94</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#quick">Preface: Quick Start Guides</a></li>
         <li><a href="#intro">Introduction</a></li>
+        <li><a href="#features">Feature List</a></li>
         <li><a href="#requirements">Requirements</a></li>
         <li><a href="#installation">Installation and Configuration</a></li>
         <li><a href="#files">Summary of Files and Folders</a></li>
@@ -13,7 +14,7 @@
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
-        <li><a href="#page-options">Options for Pages that are Indexed</a></li>
+        <li><a href="#page-options">Page Indexing and Search Options</a></li>
         <li><a href="#editor">Results Editor</a></li>
         <li><a href="#sources">Search Sources</a></li>
         <li><a href="#machines">GUI for Managing Machines and Servers</a></li>
@@ -103,7 +104,7 @@
     engine problem. On each machine there would, in addition, be several
     search related processes, and for crawling, hundreds of simultaneous
     threads would be active to manage open connections to remote machines.
-    Without threading downloading millions of pages would be very slow.
+    Without threading, downloading millions of pages would be very slow.
     Yioop is written in <a href="http://www.php.net/">PHP</a>. This
     language is the `P' in the very popular
     <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29">LAMP</a>
@@ -116,7 +117,7 @@
     As the problem of managing many machines becomes more difficult as
     the number of machines grows, Yioop further has a web interface for
     turning on and off the processes related to crawling on remote machines
-    managed by Yioop</p>
+    managed by Yioop.</p>
     <p>There are several aspects of a search engine besides
     downloading web pages that benefit from
     a distributed computational model. One of the reasons Google was able
@@ -331,8 +332,10 @@
     large scale useful data sets that can be easily licensed. Raw data dumps
     do not contain indexes of the data though. This makes sense because indexing
     technology is constantly improving and it is always possible to re-index
-    old data. Yioop supports importing and indexing data from ARC,
-    MediaWiki XML dumps, and Open Directory RDF. It also
+    old data. Yioop supports importing and indexing data from ARC, WARC,
+    database queries results, MediaWiki XML dumps, and Open Directory RDF.
+    Yioop further has a generic text importer which can be used to index
+    log records, mail, Usenet posts, etc. Yioop also
     supports re-indexing of old Yioop data files created after version 0.66,
     and indexing crawl mixes. This means using Yioop
     you can have searchable access to many data sets as well as have the
@@ -350,26 +353,29 @@
     hourly basis.</p>
     <p>
     This concludes the discussion of how Yioop fits into the current and
-    historical landscape of search engines and indexes. Here is short summary
-    features of Yioop that should make sense after and be taken away from
-    this introduction:
+    historical landscape of search engines and indexes.
+    </p>
+    <h2 id="features">Feature List</h2>
+    <p>
+    Here is short summary features of Yioop:
     </p>
     <ul>
     <li>Yioop is an open-source, distributed crawler and search engine
     written in PHP.</li>
     <li>It is capable of crawling and indexing small sites to sites or
-    collections of sites containing ten million or low hundred of millions
+    collections of sites containing low hundreds of millions
     of documents.</li>
     <li>On a given machine it uses multi-curl to support many simultaneous
     downloads of pages.</li>
     <li>It has a web interface to select seed sites for crawls and to set what
     sites crawls should not be crawled.</li>
-    <li>It obeys robots.txt file including Google and Bing extensions such
+    <li>It obeys robots.txt files including Google and Bing extensions such
     as the Crawl-delay and Sitemap directives as well as * and $ in allow and
-    disallow. It further supports robots meta tag NONE, NOINDEX, NOFOLLOW,
-    NOARCHIVE, and NOSNIPPET and anchor tags with rel="nofollow"
+    disallow. It further supports the robots meta tag directives
+    NONE, NOINDEX, NOFOLLOW, NOARCHIVE, and NOSNIPPET. It also
+    supports anchor tags with rel="nofollow"
     attributes. It also supports X-Robots-Tag HTTP headers.</li>
-    <li>Yioop supports crawl quotas for web sites. i.e., one can control
+    <li>Yioop supports crawl quotas for web sites. I.e., one can control
     the number of urls/hour downloaded from a site.</li>
     <li>Yioop can detect website congestion and slow down crawling
     a site that it detects as congested.</li>
@@ -383,23 +389,24 @@
     interface into the active crawl.</li>
     <li>It has its own DNS caching mechanism.</li>
     <li>Yioop supports the indexing of many different filetypes including:
-    HTML, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps,
-    SVG, XLSX, and XML. It has a web interface for controlling which amongst
-    these filetypes (or all of them) you want to index.</li>
+    HTML, Atom, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF,
+    sitemaps, SVG, XLSX, and XML. It has a web interface for controlling which
+    amongst these filetypes (or all of them) you want to index. It supports
+    also attempting to extract information from unknown filetypes.</li>
     <li>Yioop supports subsearches geared towards presenting certain
     kinds of media such as images, video, and news. The list of video and
-    news sites can be configured through the GUI. News sites are updated
-    hourly.</li>
+    news sites can be configured through the GUI. Yioop has a news_updater
+    process which can be used to automatically update news feeds hourly.</li>
     <li>Crawling, indexing, and serving search results can be done on a
     single machine or distributed across several machines.</li>
     <li>The fetcher/queue_server processes on several machines can be
     managed through the web interface of a main Yioop instance.</li>
-    <li>Yioop installations can be screated with a variety of topologies:
+    <li>Yioop installations can be created with a variety of topologies:
     one queue_server and many fetchers or several queue_servers and
     many fetchers.</li>
-    <li>It determines search results using a number of iterators which
+    <li>Yioop determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
-    <li>Yioop can be configured to display word suggestion as a user
+    <li>Yioop can be configured to display word suggestions as a user
     types a query. It can also suggest spell corrections for mis-typed
     queries</li>
     <li>Since version 0.70, Yioop indexes are positional rather than
@@ -416,9 +423,8 @@
     post-processing.</li>
     <li>Yioop has a web form that allows a user to control the recrawl
     frequency for a page during a crawl.</li>
-    <li>Yioop has a web form that allows users to specify meta words
-    to be injected into an index based on whether a downloaded document matches
-    a url pattern.</li>
+    <li>Yioop has a simple page rule language for controlling what content
+    should be extracted from a page or record.</li>
     <li>Yioop uses a web archive file format which makes it easy to
     copy crawl results amongst different machines. It has a command-line
     tool for inspecting these archives if they need to examined
@@ -433,9 +439,12 @@
     <li>A given Yioop installation might have several saved crawls and
     it is very quick to switch between any of them and immediately start
     doing text searches.</li>
-    <li>Yioop supports importing data from ARC, MediaWiki XML, and ODP
-    RDF files, it also supports re-indexing of data from WebArchives created
-    since version 0.66.</li>
+    <li>Yioop supports importing data from ARC, WARC, database queries,
+    MediaWiki XML, and ODP RDF files. It has generic importing facility
+    to import text records such as access log, mail log, usenet posts, etc.,
+    which are either not compressed, or compressed
+    using gzip or bzip2. It also supports re-indexing of data from WebArchives
+    created since version 0.66.</li>
     <li>Yioop comes with its own extendable model-view-controller
     framework that you can use directly to create new sites that use
     Yioop search technology. This framework also comes with a GUI
@@ -460,7 +469,7 @@
     you need a build of PHP that incorporates multi-byte string (mb_ prefixed)
     functions, Curl, Sqlite (or at least PDO with Sqlite driver),
     the GD graphics library and the command-line interface. If you are using
-    Mac OSX Snow Leopard or Lion, the version of Apache2 and PHP that come
+    Mac OSX Snow Leopard or newer, the version of Apache2 and PHP that come
     with it suffice. For Windows, Mac, and Linux, another easy way to get the
     required software is to download a Apache/PHP/MySql suite such as
     <a href="http://www.apachefriends.org/en/xampp.html">XAMPP</a>. On Windows
@@ -474,15 +483,14 @@ to
 extension=php_curl.dll
 </pre>
 <p>
-you will also want to increase the value of post_max_size from:
+The php.ini file has a post_max_size setting you might want to change. You might
+want to change it to:
 </p>
 <pre>
-post_max_size = 8M
-to
 post_max_size = 32M
 </pre>
 <p>Yioop will work with the post_max_size set to as little as two
-megabytes byte will be faster with the larger post capacity.</p>
+megabytes bytes, but will be faster with the larger post capacity.</p>
 <p>If you are using WAMP, similar changes
 as with XAMPP must be made, but be aware that WAMP has two php.ini
 files and both of these must be changed.</p>
@@ -510,38 +518,8 @@ files and both of these must be changed.</p>
     Yioop to manage. If it is not configured then these task would need
     to be done via the command line. <b>Also, if you do not use
     the Manage Machine interface your Yioop site can make use of only one
-    queue_server.</b> On OSX and Linux, Manage Machines
-    needs to be able to schedule "at" batch jobs (type man at to find out
-    more about these). On OSX to enable
-    this ability, you might need to type:</p>
-<pre>
-sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.atrun.plist
-</pre>
-    <p>On a Linux machine, "at" will typically be enabled, however, you
-    might need to give your web server access to schedule "at" jobs. To do
-    this, you should check that the web server user is not in the file
-    /etc/at.deny . On Ubuntu Linux, Apache by default runs as www-data.
-    On OSX it runs as _www, but by default the at.deny file is not set up
-    so you probably don't need to edit it. If you are using XAMPP on either
-    of these platforms you need to ensure that Apache is not running as
-    nobody. Edit the $XAMPP/etc/httpd.conf file and set the User and Group
-    to a real user.</p>
-    <p>Some versions of Linux like Centos, have the web-server user (apache
-    for Centos) configured with noshell as the shell and make use of
-    SELinux to provide mandatory access control. Both of these can prevetn
-    at jobs from being scheduled by the web server. You can use
-    the command <tt>usermod -s /bin/sh apache</tt> to set the shell and edit
-    the SELinux domain of the web server to fix these issues in this case.</p>
-    <p>To get Manage Machines to work on a PC you need to first install
-    PsTools from Microsoft.<br />
-<a href="http://technet.microsoft.com/en-us/sysinternals/bb896649">
-http://technet.microsoft.com/en-us/sysinternals/bb896649</a>.<br />
-    Depending on how your machine is configured this can be a security risk, so
-    do some research before deciding if you really want to do this. After
-    installing PsTools you next need to edit your Environment Variables
-    and add both the path to psexec and php to your PATH variable. You can
-    find the place to set these vairables, by clicking on the Start Menu,
-    then Control Panel, System and Security, Advanced Systems and Settings.</p>
+    queue_server.</b>
+
     <p>As a final step, after installing the necessary software,
     <b>make sure to start/restart your web server and verify that
     it is running.</b></p>
@@ -552,14 +530,16 @@ http://technet.microsoft.com/en-us/sysinternals/bb896649</a>.<br />
     500MB. These  values are set near the tops of each of these files in turn
     with a line like:</p>
 <pre>
-ini_set("memory_limit","1400M");
+ini_set("memory_limit","1600M");
 </pre>
     <p>
-    If you want to reduce these memory requirements, it is advisable to also
-    reduce the values for some variables in the configs/config.php file.
-    For instance, one might reduce the values of NUM_DOCS_PER_GENERATION,
-    SEEN_URLS_BEFORE_UPDATE_SCHEDULER, NUM_URLS_QUEUE_RAM,
-    MAX_FETCH_SIZE, and URL_FILTER_SIZE. Experimenting with these values
+    Often in a VM setting these requirements are somewhat steep. It is possible
+    to get Yioop to work in environments like EC2 (be aware this might
+    violate your service agreement).
+    To reduce these memory requirements, one can manually adjust the variables
+    NUM_DOCS_PER_GENERATION, SEEN_URLS_BEFORE_UPDATE_SCHEDULER,
+    NUM_URLS_QUEUE_RAM, MAX_FETCH_SIZE, and URL_FILTER_SIZE in the
+    configs/config.php file. Experimenting with these values
     you should be able to trade-off memory requirements for speed.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
@@ -619,8 +599,25 @@ allows you to configure the debug, search access,
 database, queue server, and robot settings. It will look
 something like:
 </p>
-<img src='resources/ConfigureScreenForm2.png' alt='The configure form'/>
-<p>The <b>Debug Display</b> fieldset has three check boxes: Error Info, Query
+<img src='resources/ConfigureScreenForm2.png' alt='Basic configure form'/>
+<p>These settings suffice if you are only doing single machine crawling.
+The <b>Crawl Robot Set-up</b> fieldset  is used
+to provide websites that you crawl with information about who is crawling them.
+The field Crawl Robot Name is used to say the name of your robot. You should
+choose a common name for all of the fetchers in your set-up, but the name
+should be unique to your web-site. It is bad form to pretend to be someone
+else's robot, for example, the googlebot. As Yioop crawls it sends the web-site
+it crawls a User-Agent string, this string contains the url back to the bot.php
+file in the Yioop folder. bot.php is supposed to provide a detailed description
+of your robot. The contents of textarea Robot Description is supposed to
+provide this description and is inserted between &lt;body&gt; &lt;/body&gt;
+tags on the bot.php page.
+</p>
+<p>You might need to click Toggle Advance Settings if you are doing Yioop
+development or if you are crawling in a multi-machine setting. The
+advanced settings look like:</p>
+<img src='resources/ConfigureScreenForm3.png' alt='Advanced configure form'/>
+<p>The <b>Debug Display</b> fieldset has three checkboxes: Error Info, Query
 Info, and Test Info. Checking Error Info will mean that when the Yioop
 web app runs, any PHP Errors, Warnings, or Notices will be displayed
 on web pages. This is useful if you need to do debugging, but should not
@@ -632,7 +629,7 @@ systems library classes if the browser is navigated to
 http://YIOOP_INSTALLATION/tests/. None of these debug settings should
 be checked in a production environment.
 </p>
-<p>The <b>Search Access</b> fieldset has three check boxes:
+<p>The <b>Search Access</b> fieldset has three checkboxes:
 Web, RSS, and API. These control whether a user can use the
 web interface to get query results, whether RSS responses to queries
 are permitted, or whether or not the function based search API is
@@ -644,10 +641,12 @@ section <a href="#embedding">Embedding Yioop</a>, you can also look
 in the examples folder at the file search_api.php to see an example
 of how to use it. <b>If you intend to use Yioop
 in a configuration with multiple queue servers (not fetchers), then
-the RSS check box needs to be checked.</b></p>
+the RSS checkbox needs to be checked.</b></p>
 <p>The <b>Database Set-up</b> fieldset is used to specify what database
 management system should be used, how it should be connected to, and what
-user name and password should be used for the connection. At present sqlite2
+user name and password should be used for the connection. At present
+<a href="http://www.php.net/manual/en/intro.pdo.php">PDO</a>
+(PHP's generic DBMS interface),  sqlite2
 (called just sqlite), sqlite3, and Mysql databases are supported. The
 database is used to store information about what users are allowed to
 use the admin panel and what activities and roles these users have. Unlike
@@ -655,7 +654,17 @@ many database systems, if
 an sqlite or sqlite3 database is being used then the connection is always
 a file on the current filesystem and there is no notion of login
 and password, so in this case only the name of the database is asked for.
-For sqlite, the database is stored in WORK_DIRECTORY/data. When switching
+For sqlite, the database is stored in WORK_DIRECTORY/data.</p>
+<p>If you would like to use a different DBMS the Sqlite or Mysql, then
+the easiest way is to select PDO as the Database System and for the
+Hostname given use the DSN with the appropriate DBMS driver.
+For example, for Postgres one might have something like:</p>
+<pre>
+pgsql:host=localhost;port=5432;dbname=testdb;user=bruce;password=mypass
+</pre>
+<p>You can put the username and password either in the DSN or in the Username
+and Password fields.</p>
+<p>When switching
 database information, Yioop checks first if a usable database with the user
 supplied data exists. If it does, then it uses it; otherwise, it tries to
 create a new database. Yioop comes with a small sqlite demo database in the
@@ -665,26 +674,7 @@ which has privileges on all activities. Since different databases associated
 with a Yioop installation might have different user accounts set-up after
 changing database information you might have to sign in again.
 </p>
-<p>The <b>Search Page Elements and Links</b> fieldset is used to tell
-you which element and links you would like to have presented on the search
-landing and search results pages. The Word Suggest check box controls whether
-a dropdown of word suggestions should be presented by Yioop when a user
-starts typing in the Search box. The Subsearch checkbox controls whether the
-links for Image, Video, and News search appear in the top bar of Yioop
-You can actually configure what these links are in the
-<a href="#sources">Search Sources</a>
-activity. The checkbox here is a global setting for displaying them or
-not. In addition, if this is unchecks then the hourly activity of
-downloading any RSS media sources for the News subsearch will be turned
-off. The Signin  checkbox controls whether to display the link to the page
-for users to sign in  to Yioop  The Cache checkbox toggles whether a link to
-the cache of a search item should be displayed as part of each search result.
-The Similar checkbox toggles whether a link to similar search items should be
-displayed as part of each search result. The Inlinks checkbox toggles
-whether a link for inlinks to a search item should be displayed as part
-of each search result. Finally, the IP address checkbox toggles
-whether a link for pages with the same ip address should be displayed as part
-of each search result.</p>
+

 <p>The <b>Name Server Set-up</b> fieldset is used to tell Yioop which machine
 is going to act as a name server during a crawl and what secret string
@@ -721,19 +711,7 @@ Filecache box, tells Yioop to cache search query results in temporary files.
 Memcached probably gives a better performance boost than Filecaching, but
 not all hosting environments have Memcached available.
 </p>
-<p>
-The last fieldset is the <b>Crawl Robot Set-up</b> fieldset. This is used
-to provide websites that you crawl with information about who is crawling them.
-The field Crawl Robot Name is used to say the name of your robot. You should
-choose a common name for all of the fetchers in your set-up, but the name
-should be unique to your web-site. It is bad form to pretend to be someone
-else's robot, for example, the googlebot. As Yioop crawls it sends the web-site
-it crawls a User-Agent string, this string contains the url back to the bot.php
-file in the Yioop folder. bot.php is supposed to provide a detailed description
-of your robot. The contents of textarea Robot Description is supposed to
-provide this description and is inserted between &lt;body&gt; &lt;/body&gt;
-tags on the bot.php page.
-</p>
+

 <p>
 After filling in all the fieldsets and submitting the form,
@@ -782,11 +760,13 @@ the Yioop folder's various sub-folders contain:
 <dt>bin</dt><dd>This folder is intended to hold command-line scripts
 which are used in conjunction with Yioop. In addition to the fetcher.php
 and queue_server.php script already mentioned, it contains arc_tool.php,
-code_tool.php, mirror.php, and query_tool.php. arc_tool.php can be used to
-examine the contents of WebArchiveBundle's and IndexArchiveBundle's from the
-command line. code_tool.php is for use by developers to maintain the Yioop
-code-base in various ways. mirror.php can be used if you would like to create
-a mirror/copy of a Yioop installation. Finally, query_tool.php can be used to
+code_tool.php, mirror.php, news_updater.php and query_tool.php. arc_tool.php
+can be used to examine the contents of WebArchiveBundle's and
+IndexArchiveBundle's from the command line. code_tool.php is for use by
+developers to maintain the Yioop code-base in various ways. mirror.php can be
+used if you would like to create a mirror/copy of a Yioop installation.
+news_updater.php can be used to do hurly updates of news feed search sources
+in Yioop. Finally, query_tool.php can be used to
 run queries from the command-line.</dd>
 <dt>configs</dt><dd>This folder contains configuration files. You will
 probably not need to edit any of these files directly as you can set the most
@@ -889,7 +869,8 @@ than one table or across serveral files. The models folder has
 within it a datasources folder. A datasource is an abstraction layer
 for the particular filesystem and database system that is being used
 by a Yioop installation. At present, datasources have been defined
-for sqlite, sqlite3, and mysql databases.</dd>
+for PDO (PHP's generic DBMS interface), sqlite, sqlite3, and mysql databases.
+</dd>
 <dt>resources</dt><dd>Used to store binary resources such as graphics, video,
 or audio. For now, just stores the Yioop logo.</dd>
 <dt>scripts</dt><dd>This folder contains the Javascript files used by Yioop.
@@ -1088,7 +1069,7 @@ width="70%"/>
 <img src='resources/NewsSearch.png' alt='Example News Search Results'
 width="70%"/>
 <p>When Yioop crawls a page it adds one of the following meta
-words to the page media:text, media:image, or media:video. RSS feed
+words to the page media:text, media:image, or media:video. RSS (or Atom) feed
 sources that have been added to Media Sources under the <a href="#sources"
 >Search Sources</a>
 activity are downloaded from each hour. Each RSS item on such a downloaded
@@ -1127,7 +1108,7 @@ one would go to results for all Yahoo News articles. This is equivalent
 to doing a search on: media:news:Yahoo+News . If one clicks on the News
 subsearch, not having specified a query yet, then all stored
 news items in the current language will be displayed, roughly ranked by
-recentness. If one has RSS media sources of which are set to be from
+recentness. If one has RSS media sources which are set to be from
 different locales, then this will be taken into account on this blank query
 News page.</p>
 <p>Turning now to the topic of how to enter a query in Yioop:
@@ -1552,8 +1533,8 @@ php fetcher.php start 5
     <h3>Specifying Crawl Options and Modifying Options of the Active Crawl</h3>
     <p>As we pointed out above, next to the Start Crawl button is an Options
     link. Clicking on this link, let's you set various aspect of how
-    the next crawl should be conducted. As we mentioned before, if there is
-    a currently processing crawl there will be an options link under its stop
+    the next crawl should be conducted. If there is
+    a currently processing crawl, there will be an options link under its stop
     button. Both of these links lead to similar pages, however, for an active
     crawl fewer parameters can be changed. So we will only describe the first
     link. We do mention here though that under the active crawl options page
@@ -1566,14 +1547,16 @@ php fetcher.php start 5
     <p>There are two kinds of crawls that can be performed by Yioop
     either a crawl of sites on the web or a crawl of data that has been
     previously stored in a supported archive format such as data that was
-    crawled by Versions 0.66 and above of Yioop,
+    crawled by Versions 0.66 and above of Yioop, data coming from a database
+    or text archive via Yioop's importing methods described below,
     <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">Internet
-    Archive arc file</a>,
+    Archive ARC file</a>, <a href="http://archive-access.sourceforge.net/warc/"
+    >ISO WARC Files</a>,
     <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"
-    >MediaWiki xml dump</a>, and
+    >MediaWiki xml dump</a>,
     <a href="http://rdf.dmoz.org/"
-    >Open Directory Project RDF file</a>. We will first concentrate on
-    new web crawls and then return to archive crawls later.</p>
+    >Open Directory Project RDF file</a>, . In the next subsection, we describe
+    new web crawls and then return to archive crawls subsection after that.</p>
     <h4>Web Crawl Options</h4>
     <p>
     On the web crawl tab, the first form field, "Get Crawl Options From",
@@ -1662,37 +1645,7 @@ http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
     <p>When configuring a new instance of Yioop the file default_crawl.ini
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
     for the Options form. </p>
-    <p>The next part of the Edit Crawl Options form allows you to create
-    user-defined "meta-words". In Yioop terminology, a meta-word is a word
-    which wasn't in a downloaded document, but which is added to the
-    inverted-index as if it had been in the document. The addition of
-    user-defined meta-words is specified by giving a pattern matching rule
-    based on the url. Unlike the sites field, for these fields we allow more
-    general regular expressions .For instance, in the figure above, the word
-    column has buyart and the url pattern column has:
-    <pre>
-    http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/
-    </pre>
-    When a url matches the pattern, a word is added in the inverted index
-    corresponding to the meta-word for that document. So when the page
-    <pre>
-    http://www.ucanbuyart.com/artistproducts/baitken/0/6/
-    </pre>
-    is crawled, the word u:buyart:artistproducts:baitkin:0:6 will be associated
-    with the document. Meta-words are useful to create shorthands for
-    searches on certains kinds of sites like dictionary sites, and wikis.
-    </p>
-    <p>The last part of the Edit Crawl Options form allows you to select which
-    indexing plugins you would like to use during the crawl. For instance,
-    clicking the RecipePlugin checkbox would cause Yioop to run the code
-    in indexing_plugins/recipe_plugin.php. This code tries to detect pages
-    which are food recipes and separately extracts these recipes and clusters
-    them by ingredient. The extract recipe pages is done by the pageProcessing
-    callback in the RecipePlugin class of recipe_plugin.php; the clustering
-    is done in RecipePlugin's postProcessing method. The first method is
-    called by Yioop for each active plugin on each page downloaded. The second
-    method is called during the stop crawl process of Yioop
-    </p>
+
     <h4 id="archive-crawl">Archive Crawl Options</h4>
     <p>We now consider how to do crawls of previously obtained archives.
     From the initial crawl options screen, clicking on the Archive Crawl
@@ -1704,13 +1657,17 @@ http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
     </p>These include both previously done Yioop crawls, previously
     down recrawls (prefixed with RECRAWL::), Yioop Crawl Mixes (prefixed with
     MIX::), and crawls
-    of other file formats such as: arc, MediaWiki XML, and ODP RDF, which
+    of other file formats such as: arc, warc, database data,
+    MediaWiki XML, and ODP RDF, which
     have been appropriately prepared in the PROFILE_DIR/cache folder
-    (prefixed with ARCFILE::).
+    (prefixed with ARCFILE::). In addition, Yioop also has a generic text file
+    archive importer (also, prefixed with ARCFILE::).</p>
+    <p>
     You might want to re-crawl an existing Yioop crawl if you want to add
-    new meta-words or if you are migrating a crawl from an older version
-    of Yioop for which the index isn't readable by your newer version of
-    Yioop. For similar reasons, you
+    new meta-words, new cache page links, extract fields in a different
+    manner, or if you are migrating a crawl
+    from an older version of Yioop for which the index isn't readable by
+    your newer version of Yioop. For similar reasons, you
     might want to recrawl a previously re-crawled crawl. When you
     archive crawl a crawl mix, Yioop does a search on the keyword
     <tt>site:any</tt> using the crawl mix in question. The results are then
@@ -1721,14 +1678,14 @@ http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
     You might want to do an archive crawl of other file formats
     if you want Yioop to be able to provide search results of their content.
     Once you have selected the archive you want to crawl, you can add meta
-    words as discussed in the previous section and then save your options.
-    Afterwards, you go back to the Create Crawl screen to start your crawl.
+    words as discussed in the Crawl Time Tab Page Rule portion of the
+    <a href="#page-options">Page Options</a> section.
+    Afterwards,go back to the Create Crawl screen to start your crawl.
     As with a Web Crawl, for an archive crawl you need both the queue_server
     running and a least one fetcher running to perform a crawl.</p>
-    <p>To re-crawl
-    a previously created web archive that was made using several fetchers,
-    each of the fetchers that was used in the creation process should be
-    running. This is because the data used in the recrawl will come locally
+    <p>To re-crawl a previously created web archive that was made using several
+    fetchers, each of the fetchers that was used in the creation process should
+    be running. This is because the data used in the recrawl will come locally
     from the machine of that fetcher. For other kinds of archive crawls and mix
     crawls, which fetchers one uses, doesn't matter because archive crawl data
     comes through the name server. You might also notice that the number of
@@ -1738,37 +1695,148 @@ http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site
     data sent to appropriate queue_servers but was not yet processed by
     these queue servers. So it was waiting in a schedules folder to be
     processed in the event the crawl was resumed.</p>
-    <p>To get Yioop to detect arc, MediaWiki, and ODP RDF files you need
-    to create an PROFILE_DIR/cache/archives folder on the name
-    server machine. Yioop checks subfolders of this for
+    <p>To get Yioop to detect arc, database data,
+    MediaWiki, ODP RDF, or generic text
+    archive files you need to create an PROFILE_DIR/cache/archives folder on the
+    name server machine. Yioop checks subfolders of this for
     files with the name arc_description.ini. For example, to do a Wikimedia
     archive crawl, one could make a subfolder
     PROFILE_DIR/cache/archives/my_wiki_media_files and put in it a
     file arc_description.ini in the format to be discussed in a moment.
-    The arc_description.ini file's contents are used to give a description
-    for the archive crawl that will be displayed in the archive dropdown
-    as well as specify the kind of archives the folder contains. An
-    example arc_description.ini might look like:</p>
+    In addition to the arc_description.ini, you would also put in this
+    folder all the archive files (or links to them) that you would like to
+    index. When indexing, Yioop will process each archive file in turn.
+    Returning to the arc_description.ini file, arc_description.ini's contents
+    are used to give a description
+    of the archive crawl that will be displayed in the archive dropdown
+    as well as to specify the kind of archives the folder contains and how to
+    extract it. An example arc_description.ini might look like:</p>
     <pre>
 arc_type = 'MediaWikiArchiveBundle';
 description = 'English Wikipedia 2012';
     </pre>
     <p>In the Archive Crawl dropdown the description will appear with the
     prefix ARCFILE:: and you can then select it as the source to crawl.
-    Currently, there are three supported arc_types. For folders containing
-    file in Internet Archive arc format one can use:</p>
+    Currently, the supported arc_types are: ArcArchiveBundle,
+    DatabaseBundle, MediaWikiArchiveBundle,
+    OdpRdfArchiveBundle, TextArchiveBundle, and WarcArchiveBundle.
+    For the ArcArchiveBundle, OdpRdfArchiveBundle, MediaWikiArchiveBundle,
+    WarcArchiveBundle arc_types, generally a two line arc_description.ini
+    file like above suffices. We now describe how to import from the
+    other kind of formats in a little more detail. In general, the
+    arc_description.ini will tell Yioop how to get string items (in a
+    associative array with a minimal amount of additional information) from the
+    archive in question. Processing on these string items can then be controlled
+    using Page Rules, described in the <a href="#page-options">Page Options</a>
+    section.
+    </p>
+    <p>An example arc_description.ini where the arc_type is DatabaseBundle
+    might be:</p>
+    <pre>
+arc_type = 'DatabaseBundle';
+description = 'DB Records';
+dbms = "mysql";
+db_host = "localhost";
+db_name = "MYGREATDB";
+db_user = "someone";
+db_password = "secret";
+encoding = "UTF-8";
+sql = "SELECT MYCOL1, MYCOL2 FROM MYTABLE1 M1, MYTABLE2 M2 WHERE M1.FOO=M2.BAR";
+field_value_separator = '|';
+column_separator = '##';
+    </pre>
+    <p>Possible values for <i>dbms</i> are pdo, mysql, sqlite, sqlite3.
+    If pdo is chosen, then db_host should be a <a
+    href="http://www.php.net/manual/en/pdo.connections.php">PHP DSN</a>
+    specifying which DBMS driver to use. db_name is the name of the database
+    you would like to connect to, db_user is the database username,
+    db_password is the password for that user, and encoding is the
+    character set of rows that the database query will return.</p>
+    <p>The sql variable is used to give a query whose result
+    rows will be the items indexed by Yioop. Yioop indexes string "pages",
+    so to make these rows into a string each column result will be
+    made into a string: <i>field field_value_separator value</i>. Here
+    <i>field</i> is the name of the column, <i>value</i> is the value for that
+    column in the given result row. Columns are concatenated together
+    separated by the value of of column_separator. The resulting string is
+    then sent to Yioop's TextProcessor page processor.</p>
+    <p>We next give a few examples of arc_description.ini files
+    where the arc_type is TextArchiveBundle. First, suppose we wanted
+    to index access log file records that look like:</p>
+    <pre>
+127.0.0.1 - - [21/Dec/2012:09:03:01 -0800] "POST /git/yioop2/ HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; YioopBot; +http://localhost/git/yioop/bot.php)"
+    </pre>
+    <p>Here each record is delimited by a newline and the character encoding is
+    UTF-8. The records are stored in
+    files with the extension .log and these files are uncompressed. We then
+    might use the following arc_description.ini file:</p>
+    <pre>
+arc_type = 'TextArchiveBundle';
+description = 'Log Files';
+compression = 'plain';
+file_extension = 'log';
+end_delimiter = "\n";
+encoding = "UTF-8";
+    </pre>
+    <p>In addition to compression = 'plain', Yioop supports gzip and bzip2.
+    The end_delimeter is a regular expression indicating how to know when
+    a record ends. To process a TextArchiveBundle Yioop needs either
+    an end_delimeter or a start_delimiter (or both) to be specified. As another
+    example, for a mail.log file with entries of the form:</p>
     <pre>
-ArcArchiveBundle
+From pollett@mathcs.sjsu.edu Wed Aug  7 10:59:04 2002 -0700
+Date: Wed, 7 Aug 2002 10:59:04 -0700 (PDT)
+From: Chris Pollett &lt;pollett@mathcs.sjsu.edu&gt;
+X-Sender: pollett@eniac.cs.sjsu.edu
+To: John Doe &lt;johndoe@mail.com&gt;
+Subject: Re: a message
+In-Reply-To: <5.1.0.14.0.20020723093456.00ac9c00@mail.com&gt;
+Message-ID: &lt;Pine.GSO.4.05.10208071057420.9463-100000@eniac.cs.sjsu.edu&gt;
+MIME-Version: 1.0
+Content-Type: TEXT/PLAIN; charset=US-ASCII
+Status: O
+X-Status:
+X-Keywords:
+X-UID: 17
+
+Hi John,
+
+I got your mail.
+
+Chris
     </pre>
-    <p>For Media Wiki xml, one uses the arc_type:</p>
+    <p>The following might be used:</p>
     <pre>
-MediaWikiArchiveBundle
+arc_type = 'TextArchiveBundle';
+description = 'Mail Logs';
+compression = 'plain';
+file_extension = 'log';
+start_delimiter = "\n\nFrom\s";
+encoding = "ASCII";
     </pre>
-    <p>And for Open Directory RDF, the arc_type would be:</p>
+    <p>Notice here we are splitting records using a start delimeter. Also,
+    we have chosen ASCII as the character encoding. As a final example,
+    we show how to import tar gzip files of Usenet records as found,
+    in the <a
+    href="http://archive.org/details/utzoo-wiseman-usenet-archive"
+    >UTzoo Usenet Archive 1981-1991</a>. Further discussion on how to
+    process this collection is given in the Page Options section.</p>
     <pre>
-OdpRdfArchiveBundle
+arc_type = 'TextArchiveBundle';
+description = 'Utzoo Usenet Archive';
+compression = 'gzip';
+file_extension = 'tgz';
+start_delimiter = "\0\0\0\0Path:";
+end_delimiter = "\n\0\0\0\0";
+encoding = "ASCII";
     </pre>
-    <p>In addition, to the arc_description.ini file, remember that the subfolder
+    <p>Notice in the above we set the compression to be gzip. Then we have
+    Yioop act on the raw tar file. In tar files, content objects
+    are separated by long paddings of null's. Usenet posts begin with
+    Path, so to keep things simple we grab records which begin with
+    a sequence of null's the Path and end with another sequence of null's.</p>
+    <p>As a final reminder for this section, remember that,
+    in addition, to the arc_description.ini file, the subfolder
     should also contain instances of the files in question that you would like
     to archive crawl. So for arc files, these would be files of extension
     .arc.gz; for MediaWiki, files of extension .xml.bz2;
@@ -1838,14 +1906,32 @@ OdpRdfArchiveBundle
     be clicked.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-    <h2 id='page-options'>Options for Pages that are Indexed</h2>
-    <p>Several properties about how web pages are indexed can be controlled
-    by clicking on Page Options. This leads to a form which looks like:</p>
-<img src='resources/PageOptions.png' alt='The Page Options form'/>
-    <p>The Byte Range to Download dropdown controls how many bytes out of
+    <h2 id='page-options'>Page Indexing and Search Options</h2>
+    <p>Several properties about how web pages are indexed and
+    how pages are looked up at search time can be controlled
+    by clicking on Page Options. There are three tabs for this activity: Crawl Time,
+    Search Time, and Test Options. We will discuss each of these in turn.</p>
+    <h3>Crawl Time Tab</h3>
+    <p>Clicking on Page Options leads to the default Crawl Time Tab:</p>
+<img src='resources/PageOptionsCrawl.png' alt='The Page Options Crawl form'/>
+    <p>This tab controls some aspects about how a page is processed and indexed
+    at crawl time. The form elements before Page Field Extraction Rules
+    are relatively straightforward and we will discuss these briefly
+    below. The Page Rules textarea allows you to specify additional commands
+    for how you would like text to be extracted from a page document summary.
+    The description of this language will take the remainder of this
+    subsection.
+    </p>
+    <p>The Get Options From dropdown allows one to load in
+    crawl time options that were used in a previous crawl. Beneath this,
+    The Byte Range to Download dropdown controls how many bytes out of
     any given web page should be downloaded. Smaller numbers reduce the
     requirements on disk space needed for a crawl; bigger numbers would
-    tend to improve the search results. The next dropdown,
+    tend to improve the search results. The Cache whole crawled pages
+    checkbox says whether to when crawling to keep both the
+    whole web page downloaded as well as the summary extracted from the
+    web page (checked) or just to keep the page summary (unchecked).
+    The next dropdown,
     Allow Page Recrawl After, controls how many days that Yioop keeps
     track of all the URLs that it has downloaded from. For instance, if one
     sets this dropdown to 7, then after seven days Yioop will clear its
@@ -1857,15 +1943,273 @@ OdpRdfArchiveBundle
     get a fresher version of page it already has, this also has the benefit
     of speeding up longer crawls as Yioop doesn't need to check as many
     Bloom filter files. In particular, it might just use one and keep it in
-    memory. The Page File Types to Crawl checkboxes allow you to decide
-    which file extensions you want Yioop to download during a crawl. Finally,
-    the Title Weight, Description Weight, Link Weight field are used by
-    Yioop to decide how to weight each portion of a document when it returns
-    query results to you. The Save button of course saves any changes you
+    memory.</p>
+    <p>The Page File Types to Crawl checkboxes allow you to decide
+    which file extensions you want Yioop to download during a crawl.
+    This check is done before any download is attempted, so Yioop at that
+    point can only guess the <a href="http://en.wikipedia.org/wiki/MIME">MIME
+    Type</a>, as it hasn't received this information from the server yet.
+    An example of a url with a file extension is:</p>
+    <pre>
+    http://flickr.com/humans.txt
+    </pre>
+    <p>
+    which has the extension txt. So if txt is unchecked, then Yioop won't
+    try to download this page even though Yioop can process plain text files.
+    A url like:</p>
+    <pre>
+    http://flickr.com/
+    </pre>
+    <p>
+    has no file extension and will be assumed to be have a html extension.
+    To crawl sites which have a file extension, but no one in the above list
+    check the unknown checkbox in the upper left of this list.
+    </p>
+    <p>
+    The indexing plugins checkboxes, allow you to select which plugins
+    to use during the crawl. For instance,
+    clicking the RecipePlugin checkbox would cause Yioop to run the code
+    in indexing_plugins/recipe_plugin.php. This code tries to detect pages
+    which are food recipes and separately extracts these recipes and clusters
+    them by ingredient. The extract recipe pages is done by the pageProcessing
+    callback in the RecipePlugin class of recipe_plugin.php; the clustering
+    is done in RecipePlugin's postProcessing method. The first method is
+    called by Yioop for each active plugin on each page downloaded. The second
+    method is called during the stop crawl process of Yioop.
+    </p>
+    <p>We now return to the Page Field Extraction Rules textarea. Commands
+    in this area allow a user to control what data is extracted from
+    a summary of a page. The textarea allows you to do things like modify the
+    summary, title, and other fields extracted from a page summary;
+    extract new meta words from a summary; and add links
+    which will appear when a cache of a page is shown. Page Rules are
+    especially useful for extracting data from generic text archives and
+    database archives. How to import such archives is described in the
+    Archive Crawls sub-section of <a href="#crawls">Managing Crawls</a>.
+    The input to the page rule processor is an asscociative array that results
+    from Yioop doing initial processing on a page. To see what this array looks
+    like one can take a web page and paste it into the form on the Test Options
+    tab.  There are two types of page rule statements that a user can define:
+    command statements and assignment statements. In addition, a semicolon
+    ';' can be used to indicate the rest of a line is a comment. Although
+    the initial textarea for rules might appear small. Most modern
+    browsers allow one to resize this area by dragging on th
+    lower right hand corner of the area. This makes it relatively easy
+    to see large sets of rules.
+    </p>
+    <p>
+    A command statement takes a key field argument for the page associative
+    array and does a function call to manipulate that page. Right now the
+    supported commands are to unset that field value, to add the field and
+    field value to the META_WORD array for the page and to split the field on
+    comma, view this as a search keywords => link text association, and add
+    this the  KEYWORD_LINKS array. This can be used to add a link to a keyword
+    search on cached pages in Yioop's index. These three command have the
+    syntax:</p>
+    <pre>
+    unset(field)
+    addMetaWords(field)
+    addKeywordLink(field)
+    </pre>
+    <p>
+    Page rule assignments can either be straight assignments with '=' or
+    concatenation assignments with '.='. Let $page indicate the associative
+    array that Yioop supplies the page rule processor.
+    There are three kinds of values that one can assign:
+    </p>
+    <pre>
+    field = some_other_field ; sets $page['field'] = $page['some_other_field']
+    field = "some_string" ; sets $page['field'] to "some string"
+    field = /some_regex/replacement_where_dollar_vars_allowed/
+    ; computes the results of replacing matches to some_regex in $page['field']
+    ; with replacement_where_dollar_vars_allowed
+    </pre>
+    <p>For each of the above assignments we could have used ".=" instead of "=".
+    We next give a simple example and a more complicated example of page rules
+    and the context in which they were used:
+    </p>
+    <p>In the first example, we just want to extract meaningful titles for mail
+    log records that were read in using a TextArchiveBundleIterator. Here
+    after initial page processing a whole email would end up in the
+    DESCRIPTION field of the $page associative array given tot the page
+    rule processor. So we use the following two rules:</p>
+    <pre>
+    TITLE = DESCRIPTION
+    TITLE = /(.|\n|\Z)*?Subject:[\t ](.+?)\n(.|\n|\Z)*/$2/
+    </pre>
+    <p>We initially set the TITLE to be the whole record, then use
+    a regex to extract out the correct portion of the subject line.
+    Between the first two slashes recognizes the whole record where the pattern
+    inside the second pair of parentheses (.+?) matches the subject text.
+    The $2 after the second parenthesis says replace the value of TITLE
+    with just this portion.</p>
+    <p>The next example was used to do a quick first
+    pass processing of record from the <a href="
+    http://archive.org/details/utzoo-wiseman-usenet-archive">UTzoo Archive
+    of Usenet Posts from 1981-1991</a>. What each block does is
+    described in the comments below</p>
+    <pre>
+    ;
+    ; Set the UI_FLAGS variable. This variable in a summary controls
+    ; which of the header elements should appear on cache pages.
+    ; UI_FLAGS should be set to a string with a comma separated list
+    ; of the options one wants. In this case, we use: yioop_nav, says that
+    ; we do want to display header; version, says that we want to display
+    ; when a cache item was crawled by Yioop; and summaries, says to display
+    ; the toggle extracted summaries link and associated summary data.
+    ; Other possible UI_FLAGS are history, whether to display the history
+    ; dropdown to other cached versions of item; highlight, whether search
+    ; keywords should be highlighted in cached items
+    ;
+    UI_FLAGS = "yioop_nav,version,summaries"
+    ;
+    ; Use Post Subject line for title
+    ;
+    TITLE = DESCRIPTION
+    TITLE = /(.|\n)*?Subject:([^\n]+)\n(.|\n)*/$2/
+    ;
+    ; Add a link with a blank keyword search so cache pages have
+    ; link back to yioop
+    ;
+    link_yioop = ",Yioop"
+    addKeywordLink(link_yioop)
+    unset(link_yioop) ;using unset so don't have link_yioop in final summary
+    ;
+    ; Extract y-M and y-M-j dates as meta word u:date:y-M and u:date:y-M-j
+    ;
+    date = DESCRIPTION
+    date = /(.|\n)*?Date:([^\n]+)\n(.|\n)*/$2/
+    date = /.*,\s*(\d*)-(\w*)-(\d*)\s*.*/$3-$2-$1/
+    addMetaWord(date)
+    date = /(\d*)-(\w*)-.*/$1-$2/
+    addMetaWord(date)
+    ;
+    ; Add a link to articles containing u:date:y-M meta word. The link text
+    ; is Date:y-M
+    ;
+    link_date = "u:date:"
+    link_date .= date
+    link_date .= ",Date:"
+    link_date .= date
+    addKeywordLink(link_date)
+    ;
+    ; Add u:date:y meta-word
+    ;
+    date = /(\d*)-.*/$1/
+    addMetaWord(date)
+    ;
+    ; Get the first three words of subject ignoring re: separated by underscores
+    ;
+    subject = TITLE
+    subject = /(\s*(RE:|re:|rE:|Re:)\s*)?(.*)/$3/
+    subject_word1 = subject
+    subject_word1 = /\s*([^\s]*).*/$1/
+    subject_word2 = subject
+    subject_word2 = /\s*([^\s]*)\s*([^\s]*).*/$2/
+    subject_word3 = subject
+    subject_word3 = /\s*([^\s]*)\s*([^\s]*)\s*([^\s]*).*/$3/
+    subject = subject_word1
+    unset(subject_word1)
+    subject .= "_"
+    subject .= subject_word2
+    unset(subject_word2)
+    subject .= "_"
+    subject .= subject_word3
+    unset(subject_word3)
+    ;
+    ; Get the first newsgroup listed in the Newsgroup: line, add a meta-word
+    ; u:newsgroup:this-newgroup. Add a link to cache page for a search
+    ; on this meta word
+    ;
+    newsgroups = DESCRIPTION
+    newsgroups = /(.|\n)*?Newsgroups:([^\n]+)\n(.|\n)*/$2/
+    newsgroups = /\s*((\w|\.)+).*/$1/
+    addMetaWord(newsgroups)
+    link_news = "u:newsgroups:"
+    link_news .= newsgroups
+    link_news .= ",Newsgroup: "
+    link_news .= newsgroups
+    addKeywordLink(link_news)
+    unset(link_news)
+    ;
+    ; Makes a thread meta u:thread:newsgroup-three-words-from-subject.
+    ; Adds a link to cache page to search on this meta word
+    ;
+    thread = newsgroups
+    thread .= ":"
+    thread .= subject
+    addMetaWord(thread)
+    unset(newsgroups)
+    link_thread = "u:thread:"
+    link_thread .= thread
+    link_thread .= ",Current Thread"
+    addKeywordLink(link_thread)
+    unset(subject)
+    unset(thread)
+    unset(link_thread)
+    </pre>
+    <h3>Search Time Tab</h3>
+<p>The Page Options Search Time tab looks like:</p>
+<img src='resources/PageOptionsSearch.png' alt='The Page Options Search form'/>
+<p>The Search Page Elements and Links control group is used to tell
+which element and links you would like to have presented on the search
+landing and search results pages. The Word Suggest checkbox controls whether
+a dropdown of word suggestions should be presented by Yioop when a user
+starts typing in the Search box. The Subsearch checkbox controls whether the
+links for Image, Video, and News search appear in the top bar of Yioop
+You can actually configure what these links are in the
+<a href="#sources">Search Sources</a>
+activity. The checkbox here is a global setting for displaying them or
+not. In addition, if this is unchecked then the hourly activity of
+downloading any RSS media sources for the News subsearch will be turned
+off. The Signin checkbox controls whether to display the link to the page
+for users to sign in to Yioop  The Cache checkbox toggles whether a link to
+the cache of a search item should be displayed as part of each search result.
+The Similar checkbox toggles whether a link to similar search items should be
+displayed as part of each search result. The Inlinks checkbox toggles
+whether a link for inlinks to a search item should be displayed as part
+of each search result. Finally, the IP address checkbox toggles
+whether a link for pages with the same ip address should be displayed as part
+of each search result.</p>
+<p>The Search Ranking Factors group of controls:
+    Title Weight, Description Weight, Link Weight field are used by
+    Yioop to decide how to weigh each portion of a document when it returns
+    query results to you.
+</p>
+<p>When Yioop ranks search results it search out in its postings
+list until it finds a certain number of qualifying documents. It then
+sorts these by their score, returning usually the top 10 results.
+In a multi-queue-server setting the query is simultaneously asked by
+the name server machine of each of the queue server machines and the results
+are aggregated. The Search Results Grouping controls allow you to affect
+this behavior. Minimum Results to Group
+controls the number of results the name server
+want to have before sorting of results is done. When the name server request
+documents from each queue server, it requests for
+alpha*(Minimum Results to Group)/(Number of Queue Servers) documents.
+Server Alpha controls the number alpha.
+</p>
+<p>The Save button of course saves any changes you
     make on this form.</p>
-    <p>It should be pointed out that the settings on this form (except the
-    weight fields) only affect future crawls -- they do not affect
-    any crawls that have already occurred or are on going.</p>
+    <h3>Test Options Tab</h3>
+<p>The Page Options Test Options tab looks like:</p>
+<img src='resources/PageOptionsTest.png' alt='The Page Options Test form'/>
+<p>In the Type dropdown one can select a <
+a href="http://en.wikipedia.org/wiki/Internet_media_type">MIME Type</a> used
+to select the page processor Yioop uses to extract text from the data
+you type or paste into the textarea on this page. Test Options let's you
+see how Yioop would process a web page and add summary data to its
+index. After filling in the textarea with a page, clicking Test Process
+Page will show the $summary associative array Yioop would create
+from the page after the appropriate page processor is applied. Beneath it
+shows the $summary array that would result after user-defined page rules
+from the crawl time tab are applied. Yioop stores a serialized form
+of this array in a IndexArchiveBundle for a crawl. Beneath this array
+is an array of terms (or character n-grams) that were extracted
+from the page together with their positions in the document. Finally,
+a list of meta words that the document has are listed. Either extracted
+terms or meta-word could be used to look up this document in a Yioop index.</p>
+
     <h2 id='editor'>Results Editor</h2>
     <p>Sometimes after a large crawl one finds that there are some results
     that appear that one does not want in the crawl or that the
@@ -1903,7 +2247,7 @@ OdpRdfArchiveBundle
 <img src='resources/ResultsEditor.png' alt='The Results Editor form'/>
     <p>Using the filter websites form one can specify a list of hosts which
     should be excluded from the search results. The sites listed in the
-    Sites to Filter text area are required to be hostnames. Using
+    Sites to Filter textarea are required to be hostnames. Using
     a filter, any web page with the same host name as one listed in
     the Sites to Filter will not appear in the search results. So for example,
     the filter settings in the example image above contain the line
@@ -1954,9 +2298,9 @@ OdpRdfArchiveBundle
     http://www.yioop.com/resources/blank.png?{}
     </pre>
     <p>If one selects the media kind to be RSS (really simple syndication,
-    a kind of news feed), then the media sources
-    form has three fields: Name, again a short familiar name for the
-    RSS feed; URL, the url of the RSS feed, and Language, what language
+    a kind of news feed, you can also use Atom feeds as sources), then the
+    media sources form has three fields: Name, again a short familiar name for
+    the RSS feed; URL, the url of the RSS feed, and Language, what language
     the RSS feed is. This last element is used to control whether or
     not a news item will display given the current language settings of
     Yioop. If under the Configure activity, the subsearch checkbox
@@ -1977,7 +2321,7 @@ OdpRdfArchiveBundle
     the query string when a news subsearch is being done. Folder Name
     is also used to make the localization identifier used in translating
     the subsearch's name into different languages. This identifer will
-    have the format db_subsearch_identifer. For example,
+    have the format db_subsearch_identifier. For example,
     db_subsearch_news. Index Source, the second form element, is used
     to specify a crawl or a crawl mix that the given subsearch
     should use in returning results. Results per Page, the last form element,
@@ -1987,7 +2331,7 @@ OdpRdfArchiveBundle
     subsearches and their properties. The actions column at the end of this
     table let's one either localize or delete a given subsearch. Clicking
     localize takes one to the Manage Locale's page for the default locale
-    and that parituclar subsearch localization identifier, so that you can
+    and that particular subsearch localization identifier, so that you can
     fill in a value for it. Remembering the name of this identifier,
     one can then in Manage Locales navigate to other locales, and fill
     in translations for them as well, if desired.</p>
@@ -2017,8 +2361,19 @@ OdpRdfArchiveBundle
     misconfigured  or that you no longer want to manage through this Yioop
     instance. To modify a machine that you have already added, you should
     delete it and re-add it using the setting you want. The Machine Information
-    section of the Manage Machines activity consists of boxes for
-    each machine that you have added. Each box lists the queue server,
+    section begins with a dropdown to control how you would like the news
+    updating to operate on the name server. This allows you to control whether
+    or not Yioop attempts to update its RSS (or Atom) search sources on
+    an hourly basis. The three possible values for this dropdown are:
+    Updates Off, in which case, no updating will occur; Web Update, which
+    means if someone come to the Yioop site and it has been longer than a hour
+    since the last update, as part of returning the page to the user, Yioop
+    will perform a news update; and News Update, which tells Yioop to start
+    the bin/news_update.php program as a process to handle news updating.
+    </p>
+    <p>
+    Beneath the News Updater dropbox is a set of boxes for each machine you
+    have added to Yioop. Each box lists the queue server,
     if any, and each of the fetchers you requested to be able to manage.
     Next to these there is a link to the log file for that server/fetcher
     and below this there is an On/Off switch for starting and stopping
@@ -2606,19 +2961,6 @@ xmlns:atom="http://www.w3.org/2005/Atom"
     <p>If your processor is cool, only relies on code you wrote, and you
     want to contribute it back to the Yioop, please feel free to
     e-mail it to chris@pollett.org .</p>
-    <h3>Using a Different Database Management System (DBMS)</h3>
-    <p>Yioop currently supports Sqlite2, Sqlite3, and MySql databases.
-    To add support for a different DBMS, you would need to write a new subclass
-    of the DatasourceManager abstract class. The current subclasses can be
-    found in models/datasources. Yioop relies on pretty vanilla SQL;
-    however, it does make use of the fact that some of its tables have
-    AUTOINCREMENT columns. This can be simulated in Oracle and DB2 using
-    the more sophisticated sequences and triggers. For Postgres you
-    can make the column serial. In each case, to get things to work, you
-    will need to edit the models/profile_model.php file and modify
-    the method migrateDatabaseIfNecessary($dbinfo) to say how
-    AUTOINCREMENT columns should be handled.</p>
-
     <h3>Writing an Indexing Plugin</h3>
     <p>An indexing plugin provides a way that an advanced end-user
     can extend the indexing capabilities of Yioop. Bundled with
@@ -2760,10 +3102,7 @@ Please choose an option:
     <p>Another thing to consider when configuring a collection of Yioop
     machines in such a setting, is, by default, under Search Access Set-up,
     subsearch is unchecked. This means the RSS feeds won't be downloaded
-    hourly on such machines. If one unchecks this, they will. This may or
-    may not make sense to do -- it might be advantageous to distribute the
-    downloading of RSS feeds across several machines -- any machine in
-    a Yioop cluster can send media news results in response to a search query.
+    hourly on such machines. If one unchecks this, they can be.
     </p>
     <h3 id="arc_tool">Examining the contents of WebArchiveBundle's and
     IndexArchiveBundles's</h3>
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 5fe2c43..82ac2df 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -1,14 +1,13 @@
 <h1>Downloads</h1>
 <h2>Yioop Releases</h2>
-<p>The Yioop source code is still at an alpha stage. </p>
+<p>The two most recent versions of Yioop are:</p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&amp;p=yioop&h=b147860d56e941ba2925036589c08c5d380ec71d&amp;
+hb=f7b96d54b1c35ff6dabaee3e832d13b6e816bb35&amp;t=zip"
+    >Version 0.94-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&amp;h=da73fb8ad24ba67201a3cccaa6290d711f505ef3&amp;
 hb=fb79c4c0b11379bee3b8c4c803f9f938a9001c16&amp;t=zip"
     >Version 0.921-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?
-a=archive&amp;p=yioop&amp;h=3ba7c0901b792891b6b279732e5184668b294e44&amp;
-hb=8b105749c471bbfe97df88e84df8f9c239027a01&amp;t=zip"
-    >Version 0.90-ZIP</a></li>
 </ul>
 <h2>Installation</h2>
 <p>The documentation page has information about the
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index 7f91822..7686569 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -32,7 +32,7 @@ can select amongst crawls which exist in a crawl folder as to which crawl you
 want to serve from.</li>
 <li><b>Make it easy to crawl archives.</b> There are many sources of
 raw web data available today such as files that use the Internet Archive's
-arc format, Open Directory Project RDF data, Wikipedia xml dumps, etc. Yioop
-can index these formats directly, allowing one to get an index for these
+arc and warc formats, Open Directory Project RDF data, Wikipedia xml dumps, etc.
+Yioop can index these formats directly, allowing one to get an index for these
 high-value sites without needing to do an exhaustive crawl.</li>
-</ul>
\ No newline at end of file
+</ul>
diff --git a/en-US/pages/install.thtml b/en-US/pages/install.thtml
index aa0b646..7f1a677 100755
--- a/en-US/pages/install.thtml
+++ b/en-US/pages/install.thtml
@@ -11,15 +11,22 @@

 <h2 id="xampp">XAMPP on Windows</h2>
 <ol>
-<li>Download <a
-    href="http://technet.microsoft.com/en-us/sysinternals/bb896649">pstools</a>
-    (which contains psexec).</li>
 <li>Download <a
     href="http://www.apachefriends.org/en/xampp-windows.html">Xampp</a>
 (Note: Yioop! 0.9 or higher works on latest version;
 Yioop! 0.88 or lower works up till Xampp 1.7.7).</li>
 <li>Install xampp.</li>
-<li>Copy PsExec from the pstools zip folder to C:\xampp\php .</li>
+<li> In Xampp 1.8.1 and higher, php curl seems to be enabled by default.
+For earlier versions, edit the file C:\xampp\php\php.ini
+in Notepad. Search on curl. Change the line:
+<pre>
+;extension=php_curl.dll
+
+to
+
+extension=php_curl.dll
+</pre>
+</li>
 <li>Open Control Panel. Go to System =&gt; Advanced system settings =&gt;
 Advanced. Click on Environment Variables. Look under System Variables and
 select Path. Click Edit. Tack onto the end of Variable Values:
@@ -30,27 +37,8 @@ Click OK a bunch times to get rid of windows. Close the Control Panel window.
 Reopen it and go to the same place to make sure the path variable really
 was changed.
 </li>
-<li>Edit the file C:\xampp\php\php.ini in Notepad. Search on curl.
-Change the line:
-<pre>
-;extension=php_curl.dll
-</pre>
-to
-<pre>
-extension=php_curl.dll
-</pre>
-Then go to start of file and search on post_max_size. Change the line
-<pre>
-post_max_size = 8M
-</pre>
-to
-<pre>
-post_max_size = 32M
-</pre>
-Start Apache. The post_max_size change is not strictly necessary,
-but will improve performance.</li>
 <li>Download <a href="http://www.seekquarry.com/viewgit/?a=summary&amp;p=yioop"
->Yioop</a> (You should choose a version &gt; 0.88 or the latest version).
+>Yioop</a> (You should choose a version &ge; 0.94 or the latest version).
 Unzip it into
 <pre>
 C:\xampp\htdocs
@@ -73,19 +61,16 @@ It will ask you to log into Yioop. Login with username root and empty password.
 In Yioop's Configure screen continue filling out your settings:
 <pre>
 Default Language: English
-Debug Display: (all checked)
-Search access: (all checked)
-Database Set-up: (left unchanged)
-Search Auxiliary Links Displayed: (all checked)
-Name Server Set-up
-Server Key: 0
-Name Server Url: http://localhost/yioop/
+
 Crawl Robot Name: TestBot
-Robot Instance: A
-Robot Description: TestBot should be disallowed from everywhere because
-the installer of Yioop did not customize this to his system.
-Please block this ip.
+Robot Description: This bot is for test purposes. It respects robots.txt
+If you having problems with it please feel free to ban it.
 </pre>
+Crawl robot name is what will appear together with a url to a bot.php
+page in web server log files of sites you crawl. The bot.php page will display
+what you write in robot description. This should give contact information
+in case your robot misbehaves. Obviously, you should customize
+the above to what you want to say.
 </li>
 <li>Go to Manage Machines and add a single machine under Add Machine:
 <pre>
@@ -111,40 +96,23 @@ until you see the Total URLs Seen &gt; 1.</li>
 crawls list.Set it as the default crawl. Then you can search using this index.
 </li>
 </ol>
-<p>
-The above set-up is for a non-command line crawl, and it works as described.
-For command line crawls on versions of Yioop prior to Version 0.9 you might
-have the problem that log messages are written to Xampp's PHP error log
-because Yioop uses the PHP error_log function and on Xampp this is where
-it defaults to. This is not an issue in Version 0.9 or above.
-</p>
-

 <h2 id="wamp">Wamp</h2>
-<p>
-These instructions should work for Yioop! Version 0.84 and above.
-WampServer allows you to run a 64 bit version of PHP.
-</p>
 <ol>
-<li>Download <a
-    href="http://technet.microsoft.com/en-us/sysinternals/bb896649">pstools
-    (which contains psexec)</a>.</li>
 <li>Download <a
     href="http://www.wampserver.com/en/">WampServer</a> (Note: Yioop! 0.9 or
 higher works works with PHP 5.4)</li>
 <li>Download <a href="http://www.seekquarry.com/viewgit/?a=summary&amp;p=yioop"
->Yioop!</a> (you should choose some version &gt; 0.88 or latest)
+>Yioop!</a> (you should choose some version &ge; 0.94 or latest)
 Unzip it into
 <pre>
 C:\wamp\www
 </pre>
 Rename the downloaded folder yioop (so you should now have
 a folder C:\wamp\www\yioop).</li>
-<li>Edit php.ini to enable multicurl and change the post_max_size. To do
+<li>Edit php.ini to enable multicurl. To do
 this use the Wamp dock tool and navigate to wamp =&gt; php =&gt; extension.
-Turn on curl. Next navigate to wamp =&gt; php =&gt; php.ini .
-Do a find on post_max_size and set its value to 32MB. The post_max_size change
-is not strictly necessary, but will improve performance.</li>
+Turn on curl. Next navigate to wamp =&gt; php =&gt; php.ini .</li>
 </li>
 <li>Wamp has two php.ini files. The one we just edited by doing this is in
 <pre>
@@ -152,45 +120,41 @@ C:\wamp\bin\apache\Apache2.2.21\bin
 </pre>
 You need to also edit the php.ini in
 <pre>
-C:\wamp\bin\php\php5.3.10
+C:\wamp\bin\php\php5.4.3
 </pre>
 Depending on your version of Wamp the PHP version number may be different.
-Open this php.ini in Notepad search on curl then uncomment the line. Similarly,
-edit post_max_size and set it to 32MB.
+Open this php.ini in Notepad search on curl then uncomment the line.
+It should be noted that you might want to choose an earlier or later
+version of Wamp than the particular one above. This is because out of the box
+its php_curl.dll did not work. I had to go to
+<a href="
+http://www.anindya.com/php-5-4-3-and-php-5-3-13-x64-64-bit-for-windows/"
+>Anindya.com</a>, download php_curl-5.4.3-VC9-x64.zip under fixed
+curl extensions, then move it to C:\wamp\bin\php\php5.4.3\ext to get it to
+work.
 </li>
-<li>Copy PsExec.exe to C:\wamp\bin\php\php5.3.10 .</li>
-<li>Go  to control panel =&gt; system =&gt; advanced system settings =>
+<li>Next go  to control panel =&gt; system =&gt; advanced system settings =>
 advanced =&gt; environment variables =&gt; system variables =&gt;path.
 Click edit and add to the path variable:
 <pre>
-;C:\wamp\bin\php\php5.3.10;
+;C:\wamp\bin\php\php5.4.3;
 </pre>
 Exit control panel, then re-enter to double check that path really was added
  to the end.</li>
-<li> Next go to
-wamp =&gt; apache =&gt; restart service. In a browser, go to
-http://localhost/yioop/ . You should see a configure screen
-where you can enter C:/yioop_data for the Work Directory. It
-will ask you to re-login. Use the login: root and no password.
-Now go to Yioop =&gt;
-Configure and input the following settings:
+<li>
+In Yioop's Configure screen continue filling out your settings:
 <pre>
-Search Engine Work Directory: C:/yioop_data
 Default Language: English
-(initially only the above )
-Debug Display: (all checked)
-Search access: (all checked)
-Database Set-up: (left unchanged)
-Search Auxiliary Links Displayed: (all checked)
-Name Server Set-up
-Server Key: 0
-Name Server Url: http://localhost/yioop/
-Caral Robot Name: TestBot
-Robot Instance: A
-Robot Description: TestBot should be disallowed from everywhere because
-the installer of Yioop did not customize this to his system.
-Please block this ip.
+
+Crawl Robot Name: TestBot
+Robot Description: This bot is for test purposes. It respects robots.txt
+If you having problems with it please feel free to ban it.
 </pre>
+Crawl robot name is what will appear together with a url to a bot.php
+page in web server log files of sites you crawl. The bot.php page will display
+what you write in robot description. This should give contact information
+in case your robot misbehaves. Obviously, you should customize
+the above to what you want to say.
 </li>
 <li>Go to Manage Machines. Add a single machine under Add Machine using the
 settings:
@@ -266,26 +230,6 @@ be as you like.
 </li>
 </ul>
 </li>
-<li>
-Modify the php.ini file, this is likely in the file /private/etc/php.ini.
-Change
-<pre>
-post_max_size = 8M
-to
-post_max_size = 32M
-</pre>
-Restart the web server after making this change. This change is not strictly
-necessary, but will improve performance.
-</li>
-<li>
-We are going to configure Yioop so that fetchers and queue_servers
-can be started from the GUI interface. On an OSX machine, Yioop makes
-use of the Unix "at" command. On OSX to enable  "at" jobs, you might need to
-type:
-<pre>
-sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.atrun.plist
-</pre>
-</li>
 <li>For the remainder of this guide, we assume document root for
 the web server is: /Library/WebServer/Documents.
 <a href="http://www.seekquarry.com/viewgit/?a=summary&amp;p=yioop"
@@ -322,19 +266,15 @@ Configure and input the following settings:
 <pre>
 Search Engine Work Directory: /Library/WebServer/Documents/yioop_data
 Default Language: English
-Debug Display: (all checked)
-Search access: (all checked)
-Database Set-up: (left unchanged)
-Search Auxiliary Links Displayed: (all checked)
-Name Server Set-up
-Server Key: 0
-Name Server Url: http://localhost/yioop/
-Caral Robot Name: TestBot
-Robot Instance: A
-Robot Description: TestBot should be disallowed from everywhere because
-the installer of Yioop did not customize this to his system.
-Please block this ip.
+Crawl Robot Name: TestBot
+Robot Description: This bot is for test purposes. It respects robots.txt
+If you having problems with it please feel free to ban it.
 </pre>
+Crawl robot name is what will appear together with a url to a bot.php
+page in web server log files of sites you crawl. The bot.php page will display
+what you write in robot description. This should give contact information
+in case your robot misbehaves. Obviously, you should customize
+the above to what you want to say.
 </li>
 <li>Go to Manage Machines. Add a single machine under Add Machine using the
 settings:
@@ -376,8 +316,9 @@ and /etc/apache2/mods-enabled/php5.load should exist and link
 to the corresponding files in /etc/apache2/mods-available. The configuration
 files for PHP are /etc/php5/apache2/php.ini (for the apache module)
 and /etc/php5/cli/php.ini (for the command-line interpreter).
-You want to make changes to both configurations. Using your favorite
-texteditor, ed, vi, nano, gedit, etc., modify the line:
+You want to make changes to both configurations. To get a feel for the
+changes you can make in a
+texteditor: ed, vi, nano, gedit, etc., modify the line:
 <pre>
 post_max_size = 8M
 to
@@ -395,12 +336,6 @@ sudo apachectl stop
 sudo apachectl start
 </pre>
 </li>
-<li>We are going to configure Yioop so that fetchers and queue_servers
-can be started from the GUI interface. On a Linux machine, Yioop makes
-use of the Unix "at" command. Under Ubuntu, "at" will typically be enabled,
-however, you might need to give your web server access to schedule
-"at" jobs. To do this, check that the web server user (www-data)
-is not in the file /etc/at.deny .</li>
 <li>The DocumentRoot for web sites (virtual hosts) served by an Ubuntu Linux
 machine is typically specified by files in /etc/apache2/sites-enabled.
 In this example, it was given in a file 000-default and specified to
@@ -421,21 +356,17 @@ will ask you to re-login. Use the login: root and no password.
 Now go to Yioop =&gt;
 Configure and input the following settings:
 <pre>
-Search Engine Work Directory: /var/www/yioop_data
+Search Engine Work Directory: /Library/WebServer/Documents/yioop_data
 Default Language: English
-Debug Display: (all checked)
-Search access: (all checked)
-Database Set-up: (left unchanged)
-Search Auxiliary Links Displayed: (all checked)
-Name Server Set-up
-Server Key: 0
-Name Server Url: http://localhost/yioop/
-Caral Robot Name: TestBot
-Robot Instance: A
-Robot Description: TestBot should be disallowed from everywhere because
-the installer of Yioop did not customize this to his system.
-Please block this ip.
+Crawl Robot Name: TestBot
+Robot Description: This bot is for test purposes. It respects robots.txt
+If you having problems with it please feel free to ban it.
 </pre>
+Crawl robot name is what will appear together with a url to a bot.php
+page in web server log files of sites you crawl. The bot.php page will display
+what you write in robot description. This should give contact information
+in case your robot misbehaves. Obviously, you should customize
+the above to what you want to say.
 </li>
 <li>Go to Manage Machines. Add a single machine under Add Machine using the
 settings:
@@ -464,7 +395,10 @@ able to search using this index.
 <a href="https://www.virtualbox.org/">VirtualBox</a>. The keyboard settings
 for the particular image on the VirtualBox site are Italian, so you will
 have to tweak them to get an American keyboard or the keyboard you are most
-comfortable with. To get started, log in, launch a terminal window, and su root.
+comfortable with. Also, in this virtual setting the memory available is
+somewhat low so you might need to tweak values in config/config.php to
+reduce the memory needs of yioop. To get started, log in, launch a terminal
+window, and su root.
 </p>
 <ol>
 <li>The image we were using doesn't have Apache installed or the nano editor.
@@ -487,20 +421,6 @@ nano welcome.conf
 </pre>
 Then using the editor put #'s at the start of each line and save the result.
 </li>
-<li>Yioop needs to be able to issue shell commands to start and stop
-machines. In particular, it uses the "at daemon" (atd) to do this.
-The web server on Centos runs as user Apache and by default its shell is
-specified as /sbin/nologin. Also, Centos makes use of SELinux and the domain
-under which Apache runs prevents it from issuing at commands as well.
-You probably want to use audit2allow and semanage to configure exactly
-the settings you need to get Yioop! to run. For the purposes of expediency
-(maybe faster to get fired), however, one can type:
-<pre>
-usermod -s /bin/sh apache
-chcon -t unconfined_exec_t /usr/sbin/httpd
-</pre>
-Please do not use the above in a production environment!
-</li>
 <li>Next we install git, php, and the various php extensions we need:
 <pre>
 yum install git
@@ -524,10 +444,10 @@ chmod 777 yioop_data
 </pre>
 </li>
 <li>
-If the web server and atd are not running then start them:
+Restart/start the web server:
 <pre>
+service httpd stop
 service httpd start
-service atd start
 </pre>
 </li>
 <li>Tell Yioop where its work directory is:
@@ -548,21 +468,17 @@ will ask you to re-login. Use the login: root and no password.
 Now go to Yioop =&gt;
 Configure and input the following settings:
 <pre>
-Search Engine Work Directory: /var/www/yioop_data
+Search Engine Work Directory: /Library/WebServer/Documents/yioop_data
 Default Language: English
-Debug Display: (all checked)
-Search access: (all checked)
-Database Set-up: (left unchanged)
-Search Auxiliary Links Displayed: (all checked)
-Name Server Set-up
-Server Key: 0
-Name Server Url: http://localhost/yioop/
-Caral Robot Name: TestBot
-Robot Instance: A
-Robot Description: TestBot should be disallowed from everywhere because
-the installer of Yioop did not customize this to his system.
-Please block this ip.
+Crawl Robot Name: TestBot
+Robot Description: This bot is for test purposes. It respects robots.txt
+If you having problems with it please feel free to ban it.
 </pre>
+Crawl robot name is what will appear together with a url to a bot.php
+page in web server log files of sites you crawl. The bot.php page will display
+what you write in robot description. This should give contact information
+in case your robot misbehaves. Obviously, you should customize
+the above to what you want to say.
 </li>
 <li>Go to Manage Machines. Add a single machine under Add Machine using the
 settings:
@@ -742,6 +658,8 @@ timestamp 10, each instance would have a WORK_DIR/cache/IndexData10
 folder and these folders would be disjoint from any other
 instance.
 </li>
+<li>Click Toggle Advanced Settings to see the additional configuration
+fields needed for what follows.</li>
 <li>
 Continuing down on the Configure element for each instance, make sure under the
 Search Access fieldset Web, RSS, and API are checked.</li>
@@ -757,7 +675,7 @@ The Crawl Robot Name should also be the same for the two instances, say:
 <pre>
 TestBotFeelFreeToBan
 </pre>
-but we want the robot instance to be different, say 1 and 2.
+but we want the Robot Instance to be different, say 1 and 2.
 </li>
 <li>Go to the Manage Machine element for git/yioop1, which is the name server.
 Only the name server needs to manage machines,

ViewGit