update documentation to v0.6 Yioop, a=chris

Chris Pollett [2011-01-01 02:Jan:st]

update documentation to v0.6 Yioop, a=chris

Filename
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 64207bd..914229b 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v.5</h1>
+<h1>Yioop! Documentation v 0.6</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -68,7 +68,7 @@
     [<a href="#P1997b">P1997b</a>]. Google [<a href="#BP1998">BP1998</a>]
     circa 1998 in comparison had an index of about 25 million pages.
     These systems used many machines each working on parts of the search
-    engine problem. On each machine there would in addition be several
+    engine problem. On each machine there would, in addition, be several
     search related processes, and for crawling, hundreds of simultaneous
     threads would be active to manage open connections to remote machines.
     Without threading downloading millions of pages would be very slow.
@@ -89,17 +89,17 @@
     involves repeatedly applying Google's normalized variant of the
     web adjacency matrix to an initial guess of the page ranks. This problem
     naturally decomposes into rounds. Within a round the Google matrix is
-    applied to the current page ranks estimates of a set of sites. This operation
-    is reasonable easy to distribute to many machines. Computing how relevant
-    a word is to a document is another
+    applied to the current page ranks estimates of a set of sites. This
+    operation is reasonable easy to distribute to many machines. Computing how
+    relevant a word is to a document is another
     task that benefit from multi-round, distributed computation. When a document
-    is processed by an indexers on multiple machine, words are extracted and
+    is processed by indexers on multiple machine, words are extracted and
     stemming algorithm such as [<a href="#P1980">P1980</a>] might be employed
     (a stemmer would extract the word jump from words such as jumps, jumping,
     etc). Next a statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
     is computed to determine the importance of that word in that document
     compared to that word amongst all other documents. To do this calculation
-    one needs to compute global statistics concerning of all documents seen,
+    one needs to compute global statistics concerning all documents seen,
     such as their average-length, how often a term appears in a document, etc.
     If the crawling is distributed it might take one or more merge rounds to
     compute these statistics based on partial computations on many machines.
@@ -182,7 +182,11 @@
     Thus, link text is used in the score of a document. How much weight a word
     from a link gets also depends on the link's rank. So a low-ranked link with
     the word "stupid" to a given site would tend not to show up early in the
-    results for the word "stupid".
+    results for the word "stupid". Grouping also is used to handle
+    deduplication: It might be the case that the pages of many different URLs
+    have essentially the same content. Yioop! creates a hash of the web page
+    content of each downloaded url. Amongst urls with the same hash only the
+    one that is linked to the most will be returned after grouping.
     </p>
     <p>
     There are several open source crawlers which can scale to crawls in the
@@ -205,7 +209,9 @@
     file format for storing web page, web summary data. They
     have the advantage of allowing one to store many small files compressed
     as one big file. They also make data from web crawls very portable,
-    making them easy to copy from one location to another.
+    making them easy to copy from one location to another. Like Nutch and
+    Heritrix, Yioop! also has a command line tool for quickly looking at the
+    contents of such archive objects.
     </p>
     <p>
     This concludes the discussion of how Yioop! fits into the current and
@@ -240,8 +246,10 @@
     to be injected into an index based on whether a downloaded document matches
     a url pattern.</li>
     <li>Yioop! uses a web archive file format which makes it easy to
-    copy crawl results amongst different machines.</li>
-    <li>Using this, crawls can be mirrored amongst several machines
+    copy crawl results amongst different machines. It has a command-line
+    tool for inspecting these archives if they need to examined
+    in a non-search setting.</li>
+    <li>Using web archives, crawls can be mirrored amongst several machines
     to speed-up serving search results. This can be further sped-up
     by using memcache.</li>
     <li>A given Yioop! installation might have several saved crawls and
@@ -262,7 +270,7 @@
     Mac, and Linux another easy way to get the required software is to
     download a Apache/PHP/MySql suite such as
     <a href="http://www.apachefriends.org/en/xampp.html">XAMPP</a>. On Windows
-    machines, find the the php.ini file under the php folde rin your Xampp
+    machines, find the the php.ini file under the php folder in your Xampp
     folder and change the line:</p>
 <pre>
 ;extension=php_curl.dll
@@ -293,11 +301,11 @@ extension=php_curl.dll
     <h3>Memory Requirements</h3>
     <p>In addition, to the prerequisite software listed above, Yioop! also
     has certain memory requirements. By default bin/queue_server.php
-    requires 1100MB, bin/fetcher.php requires 550MB, and index.php requires
+    requires 950MB, bin/fetcher.php requires 800MB, and index.php requires
     200MB. These  values are set near the tops of each of these files in turn
     with a line like:</p>
 <pre>
-ini_set("memory_limit","550M");
+ini_set("memory_limit","800M");
 </pre>
     <p>
     If you want to reduce these memory requirements, it is advisable to also
@@ -324,12 +332,16 @@ page looks like:
 </p>
 <img src='resources/ConfigureScreenForm1.png' alt='The work directory form'/>
 <p>
-For this step you must connect via localhost. Make sure the web
-server has permissions on the place where this auxiliary
-folder needs to be created. The web server also needs permissions on the
-file bin/config.php to write in the value of the directory you choose.
-On both *nix-like, and Windows machines,
-you should use forward slashes for the folder location. For example,
+For this step you must connect via localhost. Notice under the text field there
+is a heading "Component Check" and there is red text under it, this section is
+used to indicate any requirements that Yioop! has that might not be met yet on
+your machine. In the case above, the web server needs permissions on the
+file configs/config.php to write in the value of the directory you choose in the
+form for the Work Directory. Another common message asks you to make sure the
+web server has permissions on the place where this auxiliary
+folder needs to be created. When filling out the form othis page, on both
+*nix-like, and Windows machines, you should use forward slashes for the folder
+location. For example,
 </p>
 <pre>
 /Applications/XAMPP/xamppfiles/htdocs  #Mac, Linux system similar
@@ -339,11 +351,12 @@ c:/xampp/htdocs/yioop_data   #Windows
 Once you have set the folder,
 you should see a second Profile Settings form beneath the Search Engine
 Work Directory
-form. If you are asked to sign-in before this, and you had not previously
-created accounts in this Work Directory, then the default acocunt has login
-root, and an empty password. Once you see The Profile Settings form
+form. If you are asked to sign-in before this, and you have not previously
+created accounts in this Work Directory, then the default account has login
+root, and an empty password. Once you see it, The Profile Settings form
 allows you to configure the debug settings,
-database settings, queue server and robot settings. It looks like:
+database settings, queue server and robot settings. It will look
+something like:
 </p>
 <img src='resources/ConfigureScreenForm2.png' alt='The configure form'/>
 <p>The <b>Debug Display</b> field set has three check boxes: Error Info, Query
@@ -360,12 +373,12 @@ be checked in a production environment.
 </p>
 <p>The <b>Database Set-up</b> fieldset is used to specify what database management
 system should be used, how it should be connected to, and what user name
-and password should be used for the connection. At present sqlite 2
-(called just sqlite), sqlite3, and mysql databases are supported. The
+and password should be used for the connection. At present sqlite2
+(called just sqlite), sqlite3, and Mysql databases are supported. The
 database is used to store information about what users are allowed to
 use the admin panel and what activities and roles these users have. Unlike
 many database systems, if
-a sqlite variant database is being used then the connection is always
+an sqlite or sqlite3 database is being used then the connection is always
 a file on the current filesystem and there is no notion of login
 and password, so in this case only the name of the database is asked for.
 For sqlite, the database is stored in WORK_DIRECTORY/data. When switching
@@ -375,7 +388,7 @@ create a new database. Yioop! comes with a small sqlite demo database in the
 data directory and this is used to populate the installation database in this
 case. This database has one account root with no password
 which has privileges on all activities. Since different databases associated
-with a Yioop installation might have different user accounts set-up after
+with a Yioop! installation might have different user accounts set-up after
 changing database information you might have to sign in again.
 </p>
 <p>The <b>Queue Server Set-up</b> fieldset is used to tell Yioop! which machine
@@ -390,7 +403,7 @@ queue server is so they can request a batch of urls to download. There are a
 few different ways this can be set-up:
 </p>
 <ol>
-<li>If the particular, instance of Yioop! is only being used to display
+<li>If the particular instance of Yioop! is only being used to display
 search results from crawls that you have already done, then this
 fieldset can be filled in however you want.</li>
 <li>If you are doing crawling on only one machine, you can put
@@ -403,8 +416,8 @@ the url to the machine that will act as the queue_server.</li>
 <p>In communicating between the fetcher and the server, Yioop! uses
 curl. Curl can be particular about redirects in the case where posted
 data is involved. i.e., if a redirect happens, it does not send posted
-data to the redirected site. So please be careful to include a trailing
-slash if appropriate in your queue server url. Beneath the Queue Server Url
+data to the redirected site. For this reason, Yioop! insists on a trailing
+slash on your queue server url. Beneath the Queue Server Url
 field, is a Memcached checkbox. Checking this allows you to specify
 memcache servers that, if specified, will be used to cache in memory search
 query results as well as index pages that have been accessed.</p>
@@ -428,7 +441,7 @@ the installation is complete.
 </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='files'>Summary of Files and Folders</h2>
-    <p>The Yioop search engine consists of three main
+    <p>The Yioop! search engine consists of three main
 scripts:</p>
 <dl>
 <dt>bin/fetcher.php</dt><dd>Used to download batches of urls provided
@@ -467,7 +480,10 @@ about who is crawling their sites. Here is a rough guide to what
 the Yioop! folder's sub-folder contain:
 <dl>
 <dt>bin</dt><dd>This folder is intended to hold command line scripts
-which are used in conjunction with Yioop!</dd>
+which are used in conjunction with Yioop! In addition, to fetcher.php
+and queue_server.php, it contains arc_tool.php which can be used to
+examine the contents of WebArchiveBundle's and IndexArchiveBundle's from
+the command line.</dd>
 <dt>configs</dt><dd>This folder contains configuration files. You will
 probably not need to edit any of these files directly as you can set the most
 common configuration settings from with the admin panel of Yioop! The file
@@ -775,8 +791,16 @@ activities in turn.
     will return to what the Options page looks like in a moment. When
     a crawl is executing, under the start crawl form appears statistics about
     the crawl as well as a Stop Crawl button. Crawling continues until this
-    Stop Crawl button is pressed or until no new sites can be found.
-    Finally, at the bottom of the page is a table listing previously run crawls.
+    Stop Crawl button is pressed or until no new sites can be found. As a
+    crawl occurs, a sequence of IndexShard's are written. These keep track
+    of which words appear in which documents for groups of 50,000 or so
+    documents. In addition an IndexDictionary of which words appear in which
+    shard is written to a separate folder and subfolders. When the Stop button
+    is clicked the "tiers" of data in this dictionary need to be logarithmically
+    merged, this process can take a couple of minutes, so after clicking stop
+    do not kill the queue_server (if you were going to) until after it says
+    waiting for messages again. Finally, at the bottom of the page is a table
+    listing previously run crawls.
     Next to each previously run crawl are three links. The first link lets you
     resume this crawl. This will cause Yioop! to look for unprocessed fetcher
     data regarding that crawl, and try to load that into a fresh priority
@@ -1007,7 +1031,8 @@ php fetcher.php stop</pre>
     access to the source code. Thus, you can modify Yioop! to fit in
     with your existing project or add new feel free to add new features to
     Yioop! In this section, we look a little bit at some common ways you
-    might try to modify Yioop! If you decide to modify the source code
+    might try to modify Yioop! as well as ways to examine the output of a
+    crawl in a more technical manner. If you decide to modify the source code
     it is recommended you look at the <a
     href="#files">Summary of Files and Folders</a> above again, as well
     as look at the <a href="http://www.seekquarry.com/yioop-docs/">online
@@ -1078,6 +1103,68 @@ php fetcher.php stop</pre>
     will need to edit the models/profile_model.php file and modify
     the method migrateDatabaseIfNecessary($dbinfo) to say how
     AUTOINCREMENT columns should be handled.</p>
+    <h3>Examining the contents of WebArchiveBundle's and
+    IndexArchiveBundles's</h3>
+    <p>
+    The command-line script bin/arc_tool.php can be use to examine the
+    contents of a WebArchiveBundle or an IndexArchiveBundle. i.e., it gives
+    a print out of the web pages or summaries contained therein. It can also
+    be used to give information from the headers of these bundles. It is
+    run from the command-line with the syntaxes:
+    </p>
+    <pre>
+php arc_tool.php info bundle_name
+    //return info about documents stored in archive.
+php arc_tool.php list bundle_name start num
+    //outputs items start through num from bundle_name
+   </pre>
+   <p>For example,</p>
+   <pre>
+|chris-polletts-macbook-pro:bin:158&gt;php arc_tool.php info /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731
+
+Bundle Name: IndexData1293767731
+Bundle Type: IndexArchiveBundle
+Description: test
+Number of generations: 1
+Number of stored links and documents: 267260
+Number of stored documents: 16491
+Crawl order was: Page Importance
+Seed sites:
+   http://www.ucanbuyart.com/
+   http://www.ucanbuyart.com/fine_art_galleries.html
+   http://www.ucanbuyart.com/indexucba.html
+Sites allowed to crawl:
+   domain:ucanbuyart.com
+   domain:ucanbuyart.net
+Sites not allowed to be crawled:
+   domain:arxiv.org
+   domain:ask.com
+Meta Words:
+   http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/
+
+|chris-polletts-macbook-pro:bin:159&gt;
+|chris-polletts-macbook-pro:bin:202&gt;php arc_tool.php list /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3
+
+BEGIN ITEM, LENGTH:21098
+[URL]
+http://www.ucanbuyart.com/robots.txt
+[HTTP RESPONSE CODE]
+404
+[MIMETYPE]
+text/html
+[CHARACTER ENCODING]
+ASCII
+[PAGE DATA]
+&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
+
+&lt;html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"&gt;
+
+&lt;head&gt;
+	&lt;base href="http://www.ucanbuyart.com/" /&gt;
+   &lt;/pre&gt;
+....
+
     <p><a href="#toc">Return to table of contents</a>.</p>

     <h2 id="references">References</h2>
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 8ccf1ec..82c2543 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,13 +2,11 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=f9119e9023b0158cfd8399f9a9d69fd0a80c00b8&hb=16e39c831b51e20ef9a0106b4c7eeb82c279b810&t=zip"
+    >Version 0.6-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=21440ff6f620fe99477546701d01070488b6636d&
 hb=181d421bb7151a62939a18b6a843864d888f015e&t=zip"
     >Version 0.52-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&
-h=a33319a5bc6ed58c11af462e8645397fe2c76f27&
-hb=62925b2e560ee4460ecbd9369534544b102b2a34&t=zip"
-    >Version 0.42-ZIP</a></li>
 </ul>
 <h2>Git Repository</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to

ViewGit