Update SeekQuarry static pages for Version 0.84 of Yioop, a=chris

Chris Pollett [2012-03-19 21:Mar:th]

Update SeekQuarry static pages for Version 0.84 of Yioop, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index a804bb4..0d433ea 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -32,8 +32,8 @@ with localization: Mary Pollett, Jonathan Ben-David,
 Thanh Bui, Sujata Dongre, Animesh Dutta,
  Youn Kim, Akshat Kukreti, Vijeth Patil, Chao-Hsin Shih,
 and Sugi Widjaja. Thanks to
-Ravi Dhillon and Tanmayee Potluri for creating patches for Yioop! issues.
-Several of my master's students have done projects
+Ravi Dhillon Tanmayee Potluri, Shawn Tice for creating patches for Yioop!
+issues.Several of my master's students have done projects
 related to Yioop!: Amith Chandranna, Priya Gangaraju, and Vijaya Pamidi.
 Amith's code related to an Online version of the HITs algorithm
 is not currently in the main branch of Yioop!, but it is
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index fbba680..a53fff3 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.82</h1>
+<h1>Yioop! Documentation v 0.84</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -115,7 +115,7 @@
     applied to the current page ranks estimates of a set of sites. This
     operation is reasonably easy to distribute to many machines. Computing how
     relevant a word is to a document is another
-    task that benefit from multi-round, distributed computation. When a document
+    task that benefits from multi-round, distributed computation. When a document
     is processed by indexers on multiple machines, words are extracted and a
     stemming algorithm such as [<a href="#P1980">P1980</a>] or a character
     n-gramming technique might be employed (a stemmer would extract the word
@@ -164,14 +164,14 @@
     be easier to deploy in a smaller setting. Each node in a Yioop! system
     is assumed to have a web server running. One of the Yioop! nodes
     web app's is configured to act as a
-    coordinator for crawls. It is called the <b>name server</b>. In addition,
-    to this one might have several processes called <b>queue servers</b> that
-    perform scheduling and indexing jobs, as well as <b>fetcher</b> processes
-    which are responsible for downloading pages. Through the name server's
-    web app, users can send messages to the queue_servers and fetchers.
-    This interface writes message
+    coordinator for crawls. It is called the <b>name server</b>. In addition
+    to the name server, one might have several processes called
+    <b>queue servers</b> that perform scheduling and indexing jobs, as well as
+    <b>fetcher</b> processes which are responsible for downloading pages.
+    Through the name server's web app, users can send messages to the
+    queue_servers and fetchers. This interface writes message
     files that queue_servers periodically looks for. Fetcher processes
-    periodically ping the name server to find the name of current crawl
+    periodically ping the name server to find the name of the current crawl
     as well as a list of queue servers. Fetcher programs then periodically
     make requests in a round-robin fashion to the queue servers for messages
     and schedules. A schedule is data to process and a message has control
@@ -207,17 +207,24 @@
     queue server setting. To further increase query throughput,
     the number queries that can be handled at a given time, Yioop! installations
     can also be configured as "mirrors" which keep an exact copy of the
-    data stored in the site being mirrored. When query request comes into a
+    data stored in the site being mirrored. When a query request comes into a
     Yioop! node, either it or any of its mirrors might handle it.
     </p>
-    <p>Since a  multi-million page crawl might
-    take several days Yioop! supports the ability to dynamically change its
-    crawl parameters as a crawl is going on.  This allows a user on request
-    from a web admin to disallow Yioop! from continuing to crawl a site without
-    having to stop the overall crawl.  One can also through a web
+    <p>Since a  multi-million page crawl involves both downloading from the
+    web rapidly over several days, Yioop! supports the ability to dynamically
+    change its crawl parameters as a crawl is going on.  This allows a user on
+    request from a web admin to disallow Yioop! from continuing to crawl a site
+    or to restrict the number of urls/hours crawled from a site without
+    having to stop the overall crawl. One can also through a web
     interface inject new seed sites, if you want, while the crawl is occurring.
     This can help if someone suggests to you a site that might otherwise not
-    be found by Yioop! given its original list of seed sites.
+    be found by Yioop! given its original list of seed sites. Crawling
+    at high-speed can cause a website to become congested and
+    unresponsive. As of Version 0.84, if Yioop! detects a site is
+    becoming congested it can automatically slow down the crawling of that site.
+    Finally, crawling at high-speed can cause your domain name
+    server (the server that maps www.yioop.com to 173.13.143.74) to become slow.
+    To reduce the effect of this Yioop supports domain name caching.
     </p>
     <p>Despite its simpler one-round model, Yioop! does a number of things to
     improve the quality of its search results. For each link extracted from a
@@ -331,16 +338,21 @@
     downloads of pages.</li>
     <li>It has a web interface to select seed sites for crawls and to set what
     sites crawls should not be crawled.</li>
-    <li>It obeys robots.txt file including the Crawl-delay directive.
-    It supports the robots meta tag.</li>
+    <li>It obeys robots.txt file including the Crawl-delay and Sitemap
+    directives. It supports the robots meta tag.</li>
+    <li>Yioop! supports crawl quotas for web sites. i.e., one can control
+    the number of urls/hour downloaded from a site.</li>
+    <li>Yioop! can detect website congestion and slow down crawling
+    a site that it detects as congested.</li>
     <li>It supports open web crawls, but through its web interface one can
     configure it also to crawl only specifics site, domains, or collections
     of sites and domains. </li>
-    <li>Yioop! supports dynamically changing the allowed and disallowed
+    <li>IT supports dynamically changing the allowed and disallowed
     sites while a crawl is in progress.</li>
-    <li>Yioop! supports dynamically injecting new seeds site via a web
+    <li>It supports dynamically injecting new seeds site via a web
     interface into the active crawl.</li>
-    <li>It supports the indexing of many different filetypes including:
+    <li>It has its own DNS caching mechanism.</li>
+    <li>Yioop! supports the indexing of many different filetypes including:
     HTML, BMP, DOC, ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps,
     SVG, XLSX, and XML. It has a web interface for controlling which amongst
     these filetypes (or all of them) you want to index.</li>
@@ -401,14 +413,14 @@
     better (Yioop! used only to serve search results from a pre-built index
     has been tested to work in PHP 5.2), (3) Curl libraries for downloading
     web pages. To be a little more specific Yioop! has been tested with
-    Apache 2.2 and I've been told Version 0.82 works with lighttpd.
+    Apache 2.2 and I've been told Version 0.82 or newer works with lighttpd.
     It should work with other webservers, although it might take some
     finessing. For PHP,
     you need a build of PHP that incorporates multi-byte string (mb_ prefixed)
     functions, Curl, Sqlite (or at least PDO with Sqlite driver),
     the GD graphics library and the command-line interface. If you are using
     Mac OSX Snow Leopard or Lion, the version of Apache2 and PHP that come
-    with it suffice. For Windows, Mac, and Linux another easy way to get the
+    with it suffice. For Windows, Mac, and Linux, another easy way to get the
     required software is to download a Apache/PHP/MySql suite such as
     <a href="http://www.apachefriends.org/en/xampp.html">XAMPP</a>. On Windows
     machines, find the the php.ini file under the php folder in your Xampp
@@ -421,8 +433,16 @@ to
 extension=php_curl.dll
 </pre>
 <p>
-
+you will also want to increase the value of post_max_size from:
 </p>
+<pre>
+post_max_size = 8M
+to
+post_max_size = 32M
+</pre>
+<p>If you are using WAMP, similar changes
+as with XAMPP must be made, but be aware that WAMP has two php.ini
+files and both of these must be changed.</p>
 <p>
     If you are using the Ubuntu-variant of Linux, the following lines would
     get the software you need:
@@ -436,6 +456,8 @@ extension=php_curl.dll
     sudo apt-get install php5-curl
     sudo apt-get install php5-gd
     </pre>
+    <p>For both Mac and Linux, you need to alter the post_max_size
+    variable in your php.ini file as in the Windows case above.</p>
     <p>In addition to the minimum installation requirements above, if
     you want to use the Manage Machines feature in Yioop!, you might need
     to do some additional configuration. The <a href="#machines"
@@ -457,7 +479,7 @@ sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.atrun.plist
     this, you should check that the web server user is not in the file
     /etc/at.deny . On Ubuntu Linux, Apache by default runs as www-data.
     On OSX it runs as _www, but by default the at.deny file is not set up
-    so you probably don't need to edit it. If you are using Xampp on either
+    so you probably don't need to edit it. If you are using XAMPP on either
     of these platforms you need to ensure that Apache is not running as
     nobody. Edit the $XAMPP/etc/httpd.conf file and set the User and Group
     to a real user.</p>
@@ -508,7 +530,16 @@ page looks like:
 </p>
 <img src='resources/ConfigureScreenForm1.png' alt='The work directory form'/>
 <p>
-For this step you must connect via localhost. Notice under the text field there
+For this step, as a security precaution, you must connect via localhost. If you
+are in a web hosting environment (for example, if you are using cPanel
+to set up Yioop!) where it is difficult to connect using localhost, you can
+add a file, configs/local_configs.php, with the following content:</p>
+<pre>
+&lt;?php
+define('NO_LOCAL_CHECK', true);
+?&gt;
+</pre>
+<p> Returning to our installation discussion, notice under the text field there
 is a heading "Component Check" and there is red text under it, this section is
 used to indicate any requirements that Yioop! has that might not be met yet on
 your machine. In the case above, the web server needs permissions on the
@@ -870,10 +901,17 @@ The main search form for Yioop! looks like:
 <img src='resources/SearchScreen.png' alt='The Search form'/>
 <p>The HTML for this form is in views/search_views.php and the icon is stored
 in resources/yioop.png. You may want to modify these to incorporate Yioop!
-search into your site. The Yioop! logo on any screen in the Yioop!
+search into your site. For more general ways to modify the look of this pages,
+consult the <a href="#framework">Building a site using Yioop! documentation</a>.
+The Yioop! logo on any screen in the Yioop!
 interface is clickable and returns the user to the main search screen.
 One performs a search by typing a query into the search form field and
-clicking the Search button. A typical search results might look like:
+clicking the Search button. The [More Statistics] link only shows if under the
+Admin control panel you clicked on more statistics for the crawl. This link goes
+to a page showing many global statistics about the web crawl. Beneath
+this link are the Blog and Privacy links (as well as a link back to the
+SeekQuarry site). These two links are to static pages which can be customized
+through the Manage Locale activity. A typical search results might look like:
 </p>
 <img src='resources/SearchResults.png' alt='Example Search Results'
 width="70%"/>
@@ -954,14 +992,25 @@ that url and ip address.</li>
 </ul>
 <p>The remaining query types we list in alphabetical order:</p>
 <ul>
-<li><b>date:Y</b>, <b>date:Y-M</b>, <b>date:Y-M-D</b>
+<li><b>code:http_error_code</b> returns the summaries of all documents
+downloaded with that HTTP response code. For example, code:404 would
+return all summaries where the response was a Page Not Found error.</li>
+<li><b>date:Y</b>, <b>date:Y-m</b>, <b>date:Y-m-d</b>, <b>date:Y-m-d-H</b>,
+<b>date:Y-m-d-H-i</b>, <b>date:Y-m-d-H-i-s</b>
 returns summaries of all documents crawled on the given date.
 For example, <i>date:2011-01</i> returns all document crawled in
-January, 2011.</li>
+January, 2011. As one can see detail goes down to the second level, so
+one can have an idea about how frequently the crawler is hitting a given
+site at a given time.</li>
+<li><b>dns:num_seconds</b> returns summaries of all documents whose DNS
+lookup time was between num_seconds and num_seconds + 0.5 seconds.
+For example, dns:0.5.</li>
 <li><b>filetype:extension</b> returns summaries of all documents found
 with the given extension. So a search: <em>Chris Pollett filetype:pdf</em>
 would return all documents containing the words Chris and Pollett and with
 extension pdf.</li>
+<li><b>host:all</b> returns summaries of all domain level pages (pages
+where the path was /).</li>
 <li><b>index:timestamp</b> or <b>i:timestamp</b> causes the search to
 make use of the IndexArchive with the given timestamp. So a search like:
 <em>Chris Pollett i:1283121141 | Chris Pollett</em>
@@ -981,7 +1030,9 @@ at the top of the query results. If you would like to inject multiple
 keywords then separate the keywords using plus rather than white space.
 For example, <i>if:corvette!fast+car</i>.</li>
 <li><b>info:url</b> returns the summary in the Yioop! index for the given url
-only.
+only. For example, one could type info:http://www.yahoo.com/ or
+info:www.yahoo.com to get the summary for just the main Yahoo! page. This
+is useful for checking if a particular page is in the index.
 </li>
 <li><b>lang:IETF_language_tag</b>  returns summaries of all documents
 whose language can be determined to match the given language tag.
@@ -1000,9 +1051,15 @@ the spaces with plusses, <i>m:cool+mix</i>.</li>
 returns summaries of all documents which were last modified on the given date.
 For example, <i>modified:2010-02</i> returns all document which were last
 modifed in February, 2010.</li>
+<li><b>numlinks:some_number</b> returns summaries of all documents
+which had some_number of outgoing links. For example, numlinks:5.</li>
 <li><b>os:operating_system</b>  returns summaries of all documents
 served on servers using the given operating system. For example,
 <i>os:centos</i>, make sure to use lower case.</li>
+<li><b>path:path_component_of_url</b> returns summaries of all documents
+whose path component begins with path_component_of_url. For example,
+path:/phpBB would return all documents whose path started with phpBB,
+path:/robots.txt would return summaries for all robots.txt files.</li>
 <li><b>server:web_server_name</b> returns summaries of all documents
 served on that kind of web server. For example, <i>server:apache</i>.</li>
 <li><b>site:url</b>, <b>site:host</b>, or <b>site:domain</b> returns all of
@@ -1014,6 +1071,12 @@ the summaries of pages found at that url, host, or domain. As an example,
 decreasing specificity. To return all pages listed in a Yioop! index you can
 do <i>site:all</i>.
 </li>
+<li><b>size:num_bytes</b> returns summaries of all documents whose download
+size was between num_bytes and num_bytes + 5000. num_bytes must be a multiple
+of 5000. For example, size:15000.</li>
+<li><b>time:num_seconds</b> returns summaries of all documents whose download
+time excluding DNS lookup time was between num_seconds and num_seconds + 0.5
+seconds. For example, time:1.5.</li>
 <li><b>version:version_number</b> returns summaries of all documents
 served on web servers with the given version number.
 For example, one might have a query <i>server:apache version:2.2.9</i>.</li>
@@ -1319,8 +1382,10 @@ php fetcher.php start 5
     </p>
     <p>
     The format for sites, domains, and urls are the same for each of these
-    textareas, except that the Seed site area can only take urls. In this
-    format, there should be one site, url, or domain per
+    textareas, except that the Seed site area can only take urls and in
+    the Disallowed Sites/Sites with Quotas one can give a url
+    followed by #. Otherwise,
+    in this common format, there should be one site, url, or domain per
     line. You should not separate sites and domains with commas or other
     punctuation. White space is ignored. A domain can be specified as:
     </p>
@@ -1336,6 +1401,15 @@ php fetcher.php start 5
     <p>would all fall under this domain. A site can be specified
     as scheme://domain/path. For example, https://www.somewhere.com/foo/ .
     Such a site includes https://www.somewhere.com/foo/anything_more .</p>
+    <p>In the Disallowed Sites/Sites with Quotas, a number after a # sign
+    indicates that at most that many
+    pages should be downloaded from that site in any given hour. For example,
+    </p>
+    <pre>
+    http://www.ucanbuyart.com/#100
+    </pre>
+    <p>indicates that at most 100 pages are to be downloaded from
+    http://www.ucanbuyart.com/ per hour.</p>
     <p>When configuring a new instance of Yioop! the file default_crawl.ini
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
     for the Options form. </p>
@@ -1884,7 +1958,7 @@ OdpRdfArchiveBundle
     to these folders? To this one uses Yioop!'s ResourceController class
     which can be invoked by a link like:</p>
     <pre>
-    &lt;img src="?c=resource&a=get&n=myicon.png&f=resources" /&gt;
+    &lt;img src="?c=resource&amp;a=get&amp;n=myicon.png&amp;f=resources" /&gt;
     </pre>
     <p>
     Here c=resource specifies the controller, a=get specifies the activity --
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 5ad78fd..41e330f 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,12 +2,12 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2941365a784374115a3c64af686ce12c17d28cd8&hb=7d142c2546364229e4796f6a77cdda66285b6ed5&t=zip"
+    >Version 0.84-ZIP</a></li>
+</li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=9da95cca6208cc9eb5b8897bd2966a0f4879dab3&hb=59650148a7df20b56d5cdf4ddb626c83ec281dbf&t=zip"
      >Version 0.822-ZIP</a></li>
 </li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=2d8b87c682039c1698b8c88b67e6bf45a6554efd&hb=83621ce1cf93dd62e22738fec4b3d3026dcc077c&t=zip"
-    >Version 0.80-ZIP</a></li>
-</li>
 </ul>
 <h2>Git Repository / Contributing</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index 360aa04..e115b39 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -22,9 +22,9 @@ the other machines.</li>
 that it creates a word index and document ranking as it crawls rather
 than ranking as a separate step. This keeps the processing done by any
 machine as low as possible so you can still use them for what you bought them
-for. Nevertheless, it is reasonably fast: A set-up consisting of two Mac Mini's
-each with 8GB RAM, a queue_server, and six fetchers can be
-reasonably be expected to crawl around 2 million pages/day.
+for. Nevertheless, it is reasonably fast: A test set-up consisting of three
+Mac Mini's each with 8GB RAM, a queue_server, and six fetchers was able
+to crawl 100 million pages in 5 weeks.
 </li>
 <li><b>Make it easy to archive crawls.</b> Crawls are stored in timestamped
 folders that can be moved around zipped, etc. Through the admin interface you

ViewGit