Update website documentation for Version 0.80, a=chris

Chris Pollett [2011-12-07 18:Dec:th]

Update website documentation for Version 0.80, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index d3c84d3..a804bb4 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -18,7 +18,7 @@ site.</p>
 <p>The name Yioop! has the following history:
 I was looking for names that hadn't already been registered. My
 wife is Vietnamese so I thought I might have better luck with
-Vietnamese words since all the English ones all seemed to be taken.
+Vietnamese words since all the English ones seemed to have been taken.
 I started with the word giup, which is the way to spell 'help'
 in Vietnamese if you remove the accents. It was already taken.
 Then I tried yoop, which is my lame way of pronouncing how
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index cbb32b0..71230cc 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,9 +1,9 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.78</h1>
+<h1>Yioop! Documentation v 0.80</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
-        <li><a href="#required">Requirements</a></li>
+        <li><a href="#requirements">Requirements</a></li>
         <li><a href="#installation">Installation and Configuration</a></li>
         <li><a href="#files">Summary of Files and Folders</a></li>
         <li><a href="#interface">The Yioop! Search and User Interface</a></li>
@@ -11,7 +11,9 @@
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
+        <li><a href="#page-options">Options for Pages that are Indexed</a></li>
         <li><a href="#filter">Search Filter</a></li>
+        <li><a href="#machines">GUI for Managing Machines and Servers</a></li>
         <li><a href="#localizing">Localizing Yioop! to a New Language</a></li>
         <li><a href="#embedding">Embedding Yioop! in a Site</a></li>
         <li><a href="#customizing">Customizing Yioop!</a></li>
@@ -33,9 +35,9 @@
     when Yioop! might be the right choice for your search engine needs. In the
     remainder of this document after the introduction, we will discuss how to
     get and install Yioop!, the files and folders used in Yioop!,
-    user, role, and crawl management in the Yioop! system, localization in
-    the Yioop! system, embedding Yioop! in an existing, customizing Yioop!,
-    and the Yioop! command-line tools.
+    user, role, crawl, and machine management in the Yioop! system,
+    localization in the Yioop! system, embedding Yioop! in an existing web-site,
+    customizing Yioop!, and the Yioop! command-line tools.
     </p>
     <p>Since the mid-1990s a wide variety of search engine technologies
     have been explored. Understanding some of this history is useful
@@ -90,7 +92,11 @@
     the PHP language does have a multi-curl library (implemented in C) which
     uses threading to support many simultaneous page downloads. This is what
     Yioop! uses. Like these early systems Yioop! also supports the ability to
-    distribute the task of downloading web pages to several machines.</p>
+    distribute the task of downloading web pages to several machines.
+    As the problem of managing many machines becomes more difficult as
+    the number of machines grows, Yioop! further has a web interface for
+    turning on and off the processes related to crawling on remote machines
+    managed by Yioop!</p>
     <p>There are several aspects of a search engine besides
     downloading web pages that benefit from
     a distributed computational model. One of the reasons Google was able
@@ -145,7 +151,7 @@
     getting better. Since the original Google paper, techniques
     to rank pages have been simplified [<a href="#APC2003">APC2003</a>].
     It is also possible to approximate some of the global statistics
-    needed in BM25F using suitably large samples. </p>
+    needed in BM25F using suitably large samples.</p>
     <p>Yioop! tries to exploit
     these advances to use a simplified distributed model which might
     be easier to deploy in a smaller setting. Each node in a Yioop! system
@@ -175,9 +181,12 @@
     the fetcher's programs are run on each fetcher machine and the queue_server
     is run of the coordinating machine. Since a multi-million page crawl might
     take several days Yioop! supports the ability to dynamically change its
-    crawl parameters as a crawl is going on. This allows a user on request
+    crawl parameters as a crawl is going on.  This allows a user on request
     from a web admin to disallow Yioop! from continuing to crawl a site without
-    having to stop the overall crawl.
+    having to stop the overall crawl.  One can also through a web
+    interface inject new seed sites, if you want, while the crawl is occurring.
+    This can help if someone suggests to you a site that might otherwise not
+    be found by Yioop! given its original list of seed sites.
     </p>
     <p>Despite its simpler model, Yioop! does a number of things to improve the
     quality of its search results. For each link extracted from a page,
@@ -288,8 +297,8 @@
     collections of sites containing millions of documents.</li>
     <li>On a given machine it uses multi-curl to support many simultaneous
     downloads of pages.</li>
-    <li>It has a web interface to select seed sites for crawls and set what
-    sites crawls should not crawl.</li>
+    <li>It has a web interface to select seed sites for crawls and to set what
+    sites crawls should not be crawled.</li>
     <li>It obeys robots.txt file including the Crawl-delay directive.
     It supports the robots meta tag.</li>
     <li>It supports open web crawls, but through its web interface one can
@@ -297,19 +306,24 @@
     of sites and domains. </li>
     <li>Yioop! supports dynamically changing the allowed and disallowed
     sites while a crawl is in progress.</li>
+    <li>Yioop! supports dynamically injecting new seeds site via a web
+    interface into the active crawl</li>
     <li>It supports the indexing of many different filetypes including:
     HTML, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps,
-    SVG, XLSX, and XML.</li>
+    SVG, XLSX, and XML. It has a web interface for controlling which amongst
+    these filetypes (or all of them) you want to index.</li>
     <li>Crawling, indexing, and serving search results can be done on a
     single machine or distributed across several machines.</li>
     <li>It uses a simplified distributed model that is straightforward to
     deploy.</li>
+    <li>The fetcher/queue_server processes on several machines can be
+    managed through the web interface of a main Yioop! instance.</li>
     <li>It determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
     <li>Since version 0.70, Yioop indexes are positional rather than
     bag of word indexes, and a index compression scheme called Modified9
     is used.</li>
-    <li>Yioop! supports a GUI interface which makes
+    <li>Yioop! supports a web interface which makes
     it easy to combine results from several crawl indexes to create unique
     result presentations. These combinations can be done in a conditional
     manner using "if:" meta words.</li>
@@ -318,7 +332,9 @@
     <li>Yioop! supports an indexing plugin architecture to make it
     possible to write one's own indexing modules that do further
     post-processing.</li>
-    <li>Yioop! has a GUI form that allows users to specify meta words
+    <li>Yioop! has a web form that allows a user to control the recrawl
+    frequency for a page during a crawl.</li>
+    <li>Yioop! has a web form that allows users to specify meta words
     to be injected into an index based on whether a downloaded document matches
     a url pattern.</li>
     <li>Yioop! uses a web archive file format which makes it easy to
@@ -341,7 +357,7 @@
     </ul>
     <p><a href="#toc">Return to table of contents</a>.</p>

-    <h2 id="required">Requirements</h2>
+    <h2 id="requirements">Requirements</h2>
     <p>The Yioop! search engine requires: (1) a web server, (2) PHP 5.3 or
     better (Yioop! used only to serve search results from a pre-built index
     has been tested to work in PHP 5.2), (3) Curl libraries for downloading
@@ -380,22 +396,56 @@ extension=php_curl.dll
     sudo apt-get install php5-curl
     sudo apt-get install php5-gd
     </pre>
-    <p>After installing the necessary software, make sure to start/restart your
-    webserver and verify that it is running. </p>
+    <p>In addition to the minimum installation requirements above, if
+    you want to use the Manage Machines feature in Yioop!, you might need
+    to do some additional configuration. The <a href="#machines"
+    >Manage Machines</a> activity
+    allows you through a web interface to start/stop and look at the
+    log files for each of the queue_servers, and fetchers that you want
+    Yioop! to manage. If it is not configured then these task would need
+    to be done via the command line. On OSX and Linux, Manage Machines
+    needs to be able to schedule "at" batch jobs (type man at to find out
+    more about these). On OSX to enable
+    this ability, you might need to type:</p>
+<pre>
+sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.atrun.plist
+</pre>
+    <p>On a Linux machine, "at" will typically be enabled, however, you
+    might need to give your web server access to schedule "at" jobs. To do
+    this, you should check that the web server user is not in the file
+    /etc/at.deny . On Ubuntu Linux, Apache by default runs as www-data.
+    On OSX it runs as _www, but by default the at.deny file is not set up
+    so you probably don't need to edit it. If you are using Xampp on either
+    of these platforms you need to ensure that Apache is not running as
+    nobody. Edit the $XAMPP/etc/httpd.conf file and set the User and Group
+    to a real user.</p>
+    <p>To get Manage Machines to work on a PC you need to first install
+    PsTools from Microsoft.<br />
+<a href="http://technet.microsoft.com/en-us/sysinternals/bb896649">
+http://technet.microsoft.com/en-us/sysinternals/bb896649</a>.<br />
+    Depending on how your machine is configured this can be a security risk, so
+    do some research before deciding if you really want to do this. After
+    installing PsTools you next need to edit your Environment Variables
+    and add both the path to psexec and php to your PATH variable. You can
+    find the place to set these vairables, by clicking on the Start Menu,
+    then Control Panel, System and Security, Advanced Systems and Settings.</p>
+    <p>As a final step, after installing the necessary software,
+    <b>make sure to start/restart your web server and verify that
+    it is running.</b></p>
     <h3>Memory Requirements</h3>
     <p>In addition, to the prerequisite software listed above, Yioop! also
     has certain memory requirements. By default bin/queue_server.php
-    requires 1000MB, bin/fetcher.php requires 850MB, and index.php requires
+    requires 1400MB, bin/fetcher.php requires 850MB, and index.php requires
     500MB. These  values are set near the tops of each of these files in turn
     with a line like:</p>
 <pre>
-ini_set("memory_limit","1000M");
+ini_set("memory_limit","1400M");
 </pre>
     <p>
     If you want to reduce these memory requirements, it is advisable to also
     reduce the values for some variables in the configs/config.php file.
     For instance, one might reduce the values of NUM_DOCS_PER_GENERATION,
-    SEEN_URLS_BEFORE_UPDATE_SCHEDULER, PAGE_RANGE_REQUEST, NUM_URLS_QUEUE_RAM,
+    SEEN_URLS_BEFORE_UPDATE_SCHEDULER, NUM_URLS_QUEUE_RAM,
     MAX_FETCH_SIZE, and URL_FILTER_SIZE. Experimenting with these values
     you should be able to trade-off memory requirements for speed.
     </p>
@@ -1042,12 +1092,19 @@ activities in turn.
     fetchers, and because the on screen display refreshes only every 20 seconds
     or so.
     </p>
-    <h3>Prerequisites for Crawling</h3>
+    <h3 id="prereqs">Prerequisites for Crawling</h3>
     <p>Before you can start a new crawl, you need to run the queue_server.php
     script on the machine that is going to act as the queue_server and
     you need to run the fetcher.php script either on the same machine
     or on at least one other machine with Yioop! installed and which has
-    been configured with the queue_server url. To do this open a
+    been configured with the queue_server url. This can be done either via the
+    command-line or through a web interface. As described in the
+    <a href="#requirements">Requirements</a> section you might need to do some
+    addtional initial set up if you want to take the web interface approach.
+    In this section we descibe how to start the queue_server.php and fetcher.php
+    scripts via the command line; the <a href="#machines"
+    >GUI for Managing Machines and Servers</a> section describes how to do
+    it via a web interface. To begin open a
     command shell and cd into the bin subfolder of the Yioop! folder. To
     start a queue_server type:</p>
     <pre>
@@ -1109,9 +1166,14 @@ php fetcher.php stop</pre>
     only be scheduled to at most one fetcher at a time). The downside of this
     is that your internet connection might not be used to its fullest ability
     to download pages. Thus, it can make sense rather than increasing
-    NUM_MULTI_CURL_PAGES, to install multiple copies of Yioop! on a machine,
-    and run the fetcher program in each to maximize download speeds for a
-    machine. The most general crawl configuration for Yioop! is thus
+    NUM_MULTI_CURL_PAGES, to run multiple copies of the Yioop! fetcher on a
+    machine. To do this one can either install the Yioop! software multiple
+    times or give an instance number when one starts a fetcher. For example:</p>
+<pre >
+php fetcher.php start 5
+</pre>
+would start instance 5 of the fetcher program. The most general crawl
+configuration for Yioop! is thus
     typically a single queue_server and multiple machines each running multiple
     copies of the fetcher software.
     </p>
@@ -1135,7 +1197,9 @@ php fetcher.php stop</pre>
     a currently processing crawl there will be an options link under its stop
     button. Both of these links lead to similar pages, however, for an active
     crawl fewer parameters can be changed. So we will only describe the first
-    link. In the case of clicking the Option
+    link. We do mention here though that under the active crawl options page
+    it is possible to inject new seed urls into the crawl as it is progressing.
+    In the case of clicking the Option
     link next to the start button, the user should be taken to an
     activity screen which looks like:</p>
 <img src='resources/WebCrawlOptions.png' alt='Web Crawl Options Form'/>
@@ -1344,6 +1408,34 @@ OdpRdfArchiveBundle
     be clicked.
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='page-options'>Options for Pages that are Indexed</h2>
+    <p>Several properties about how web pages are indexed can be controlled
+    by clicking on Page Options. This leads to a form which looks like:</p>
+<img src='resources/PageOptions.png' alt='The Page Options form'/>
+    <p>The Byte Range to Download drop-down controls how many bytes out of
+    any given web page should be downloaded. Smaller numbers reduce the
+    requirements on disk space needed for a crawl; bigger numbers would
+    tend to improve the search results. The next drop-down,
+    Allow Page Recrawl After, controls how many days that Yioop! keeps
+    track of all the URLs that it has downloaded from. For instance, if one
+    sets this drop-down to 7, then after seven days Yioop! will clear its
+    Bloom Filter files used to store which urls have been downloaded, and it
+    would be allowed to recrawl these urls again if they happened in links. It
+    should be noted that all of the information from before the seven
+    days will still be in the index, just that now Yioop! will be able to
+    recrawl pages that it had previously crawled. Besides letting Yioop!
+    get a fresher version of page it already has, this also has the benefit
+    of speeding up longer crawls as Yioop! doesn't need to check as many
+    Bloom filter files. In particular, it might just use one and keep it in
+    memory. The Page File Types to Crawl checkboxes allow you to decide
+    which file extensions you want Yioop to download during a crawl. Finally,
+    the Title Weight, Description Weight, Link Weight field are used by
+    Yioop! to decide how to weight each portion of a document when it returns
+    query results to you. The Save button of course saves any changes you
+    make on this form.</p>
+    <p>It should be pointed out that the settings on this form (except the
+    weight fields) only affect future crawls -- they do not affect
+    any crawls that have already occurred or are on going.</p>
     <h2 id='filter'>Search Filter</h2>
     <p>The disallowed sites crawl option allows a user to specify they
     don't want Yioop! to crawl a given web site. After a crawl is done
@@ -1365,6 +1457,34 @@ OdpRdfArchiveBundle
     http://www.cs.sjsu.edu/faculty/pollett/ would not appear in search
     results.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='machines'>GUI for Managing Machines and Servers</h2>
+    <p>Rather than use the command-line as described in the
+    <a href="#prereqs">Prerequisites for Crawling</a> section, it is possible
+    to start/stop and view the log files of queue servers and fetcher
+    through the Manage Machines activity. In order to do this, the additional
+    requirements for this activity mentioned in the
+    <a href="#requirements">Requirements</a> section must have been met.
+    The Manage Machines activity looks like:</p>
+<img src='resources/ManageMachines.png' alt='The Manage Machines form'/>
+    <p>The Add machine form at the top of the page allows one to add a new
+    machine to be controlled by this Yioop! instance. The Machine
+    Name field let's you give this machine an easy to remember name,
+    the Machine URL, should be the URL to the installed Yioop! instance on
+    that machine, the Has Queue Server checkbox is used to say whether
+    that machine will be running a queue server or not, and the
+    Number of Fetchers drop-down allows you to say how many fetcher instances
+    you want to be able to manage for that machine. The Delete Machine
+    form allows you to remove a machine that you either misconfigured
+    or that you no longer want to manage through this Yioop! instance.
+    To modify a machine that you have already added, you should delete it
+    and re-add it using the setting you want. The Machine Information
+    section of the Manage Machines activity consists of boxes for
+    each machine that you have added. Each box lists the queue server,
+    if any, and each of the fetchers you requested to be able to manage.
+    Next to these there is a link to the log file for that server/fetcher
+    and below this there is an On/Off switch for starting and stopping
+    the server/fetcher. This switch is green if the server/fetcher is running
+    and red otherwise.</p>
     <h2 id='localizing'>Localizing Yioop! to a New Language</h2>
     <p>The Manage Locales activity can be used to configure Yioop
     for use with different languages and for different regions. The
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 2ab6f64..a5cd683 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,10 +2,11 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=925c5be127ca7609d8beb73cfe33907394af01e5&hb=8448028247a123cb55e4f3a4395129aeffa7fc8f&t=zip"
+    >Version 0.80-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=db58568d4957782dc85f875be3592b2b951e53a3&hb=d28a3af2b3574c17fb8425d340984fb02fcfb4a5&t=zip" >Version 0.78-ZIP</a></li>
 </li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=d0058b709eac3907ca302d3060712fafb5915822&hb=16a6d216f159af3d4c3413bf69021a6910ecae09&t=zip"
-    >Version 0.76-ZIP</a></li>
+
 </li>
 </ul>
 <h2>Git Repository / Contributing</h2>
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index bbb977d..7c050cb 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -9,8 +9,8 @@ results for a set of urls or domains.
 <p>Yioop! was designed with the following goals in mind:</p>
 <ul>
 <li><b>Make it easier to obtain personal crawls of the web.</b> Only a web
-server such as Apache and command line access to a default build of PHP 5.3
-or better is needed. Configuration can be done using a GUI interface.</li>
+server such as Apache and PHP 5.3 or better is needed. Configuration can be
+done using a GUI interface.</li>
 <li><b>Support distributed crawling of the web, if desired.</b> To download
 many web pages quickly, it is useful to have more than one machine when crawling
 the web. If you have several machines at home, simply install the software

ViewGit