Update documentation for v.4, a=cpollett

Chris Pollett [2010-09-03 23:Sep:rd]

Update documentation for v.4, a=cpollett

Filename
en-US/pages/documentation.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index db01809..eca48eb 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation</h1>
+<h1>Yioop! Documentation v.4</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -58,7 +58,8 @@
     to create full text indexes for text columns. A faster more robust approach
     is to use a stand-alone full text index server such as <a
     href="http://www.sphinxsearch.com/">Sphinx</a>. However, for these
-    approaches to work the text you are indexing needs to be in a database.
+    approaches to work the text you are indexing needs to be in a database
+    column or columns.
     </p>
     <p>
     By 1997 commercial sites like Inktomi and AltaVista already had
@@ -86,12 +87,13 @@
     to produce high quality results was that it was able to accurately
     rank the importance of web pages. The computation of this page rank
     involves repeatedly applying Google's normalized variant of the
-    web adjacency matrix to an initial guess of the page ranks. So the problem
-    naturally decomposes into rounds and within a round applying the matrix
-    to the current page ranks estimates of a set of sites can be distributed
-    to many machines. Computing how relevant a word is to a document is another
+    web adjacency matrix to an initial guess of the page ranks. This problem
+    naturally decomposes into rounds. Within a round the Google matrix is
+    applied to the current page ranks estimates of a set of sites. This operation
+    is reasonable easy to distribute to many machines. Computing how relevant
+    a word is to a document is another
     task that benefits from distributed computation. When a document is
-    processed by an indexer words are extracted and stemming algorithm such as
+    processed by an indexer, words are extracted and stemming algorithm such as
     [<a href="#P1980">P1980</a>] might be employed (a stemmer would extract
     the word jump from words such as jumps, jumping, etc). Next a statistic
     such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>] is computed to determine
@@ -116,7 +118,7 @@
     has begun to be developed [<a
     href="#KSV2010">KSV2010</a>]. This framework shows the map reduce model
     is capable of solving quite general cloud computing problems -- more
-    than is needed just to deploy a search engine.
+    than is needed just to deploy a search engine.
     </p>
     <p>Infrastructure such as this is not trivial for a small-scale business
     or individual to deploy. On the other hand, most small businesses and
@@ -130,7 +132,7 @@
     getting better. Since the original Google paper, techniques
     to rank pages have been simplified [<a href="#APC2003">APC2003</a>].
     It is also possible to approximate some of the global statistics
-    needed in BM25F using suitably large samples.</p>
+    needed in BM25F using suitably large samples. </p>
     <p>Yioop! tries to exploit
     these advances to use a simplified distributed model which might
     be easier to deploy in a smaller setting. Each node in a Yioop! system
@@ -148,7 +150,7 @@
     periodically POSTs the result of its computation back to the coordinating
     computer's web server. The data is then written to a set of received
     files. The queue_server as part of its loop looks for received files
-    and integrates their results into the index so far. A side effect of this
+    and integrates their results into the index so far. A side-effect of this
     computation model is that indexing needs to happen as the crawl proceeds.
     So as soon as the crawl is over one can do text searches on the crawl.
     Deploying this  computation model is relatively simple: The web server
@@ -157,7 +159,23 @@
     the desired location under the web server's document folder, each
     fetcher is configured to know who the queue_server is, and finally,
     the fetcher's programs are run on each fetcher machine and the queue_server
-    is run of the coordinating machine.
+    is run of the coordinating machine.
+    </p>
+    <p>Despite its simpler model, Yioop! does a number of things to improve the
+    quality of its search results. For each link extracted from a page,
+    Yioop! creates a micropage which it adds to its index. This includes
+    relevancy calculations for each word in the link as well as an
+    [<a href="#APC2003">APC2003</a>]-based ranking of how important the
+    link was. Yioop! supports a number of iterators which can be thought of
+    as implementing a stripped-down relational algebra geared towards
+    word-document indexes (this is much the same idea as Pig). One of these
+    operators allows one to perform grouping  of document results. In the search
+    results displayed, grouping by url allows all links and documents associated
+    with a url to be grouped as one object. Scoring of this group is a sum of
+    all these scores. Thus, link text is used in the score of a document. How
+    much weight a word from a link gets also depends on the link's rank. So
+    a low-ranked link with the word "stupid" to a given site would tend not
+    to show up early in the results for the word "stupid".
     </p>
     <p>
     There are several open source crawlers which can scale to crawls in the
@@ -204,6 +222,8 @@
     single machine or distributed across several machines.</li>
     <li>It uses a simplified distributed model that is straightforward to
     deploy.</li>
+    <li>It determines search results using a number of iterators which
+    can be combined like a simplified relational algebra.</li>
     <li>Indexing occurs as crawling happens, so when a crawl is stopped,
     it is ready to be used to handle search queries immediately.</li>
     <li>Yioop! uses a web archive file format which makes it easy to
@@ -288,7 +308,9 @@ page looks like:
 <p>
 For this step you must connect via localhost. Make sure the web
 server has permissions on the place where this auxiliary
-folder needs to be created. On both *nix-like, and Windows machines,
+folder needs to be created. The web server also needs permissions on the
+file bin/config.php to write in the value of the directory you choose.
+On both *nix-like, and Windows machines,
 you should use forward slashes for the folder location. For example,
 </p>
 <pre>
@@ -299,7 +321,10 @@ c:/xampp/htdocs/yioop_data   #Windows
 Once you have set the folder,
 you should see a second Profile Settings form beneath the Search Engine
 Work Directory
-form. This second form allows you to configure the debug settings,
+form. If you are asked to sign-in before this, and you had not previously
+created accounts in this Work Directory, then the default acocunt has login
+root, and an empty password. Once you see The Profile Settings form
+allows you to configure the debug settings,
 database settings, queue server and robot settings. It looks like:
 </p>
 <img src='resources/ConfigureScreenForm2.png' alt='The configure form'/>
@@ -446,7 +471,12 @@ installation. Whenever the WORK_DIRECTORY is changed it is this database
 which is initially copied into the WORK_DIRECTORY to serve as the database
 of allowed users for the Yioop! system.</dd>
 <dt>lib</dt><dd>This folder is short for library. It contains all the common
-classes for things like indexing, storing data to files, parsing urls, etc.</dd>
+classes for things like indexing, storing data to files, parsing urls, etc.
+lib contains two main subfolders: processors and index_bundle_iterators.
+The processors folder contains processors to extract page summaries for
+a variety of different mimetypes. The index_bundle_iterator folder contains
+a variety of iterators useful for iterating over lists of documents
+which might be returned during a query to the search engine.</dd>
 <dt>locale</dt><dd>This folder contains the default locale data which comes
 with the Yioop! system. A locale encapsulates data associated with a
 language and region. A locale is specified by an
@@ -561,7 +591,79 @@ The main search form for Yioop! looks like:
 <p>The HTML for this form is in views/search_views.php and the icon is stored
 in resources/yioop.png. You may want to modify these to incorporate Yioop!
 search into your site. The Yioop! logo on any screen in the Yioop!
-interface is clickable and returns the user to the main search screen. In
+interface is clickable and returns the user to the main search screen.
+One performs a search by typing a query into the search form field and
+clicking the Search button. A typical search results might look like:
+</p>
+<img src='resources/SearchResults.png' alt='Example Search Results'
+width="70%"/>
+<p>For each result back from the query, the title is a link to the page
+that matches the query term. This is followed by a brief summary of
+that page with the query words bolded. Then the document rank, relevancy,
+and overall scores are listed. Each of these results is a grouped statistic,
+several "micro index entry" are grouped together to create each. So even though
+a given "micro index entry" might have a document rank between 1 and 10 there
+sum could be a larger value. After these scores there are three links:
+Cached, Similar, and InLinks. Clicking on Cached will display Yioop's downloaded
+copy of the page in question. It will list the time of download and highlight
+the query terms. It should be noted that cached copies of web pages are
+stored on the fetcher which originally downloaded the page. The IndexArchive
+associated with a crawl is stored on the queue server and can be moved
+around to any location by simply moving the folder. However, if an archive
+is moved off the network on which fetcher lives, then the look up of a
+cached page might fail. Clicking on Similar causes Yioop! to locate the five
+words with the highest relevancy scores for that document and then to perform
+a search on those words. Finally, clicking on InLinks will take you to a page
+consisting of all the links that Yioop! found to the document in question.
+</p>
+<p>A basic query to the Yioop! search form is a typically a sequence of
+words seperated by whitespace. This will cause Yioop! to compute a
+"conjunctive query", it will look up only those documents which contain all of
+the terms listed. Yioop! also supports a variety of other search box
+commands and query types:</p>
+<ul>
+<li>Putting the query in quotes, for example "Chris Pollett", will cause
+Yioop! to perform an exact match search. Yioop! in this case would only
+return documents that have the string "Chris Pollett" rather than just
+the words Chris and Pollett possibly not next to each other in the document.
+Also, using the quote syntax, you can perform searches such as
+"Chris * Homepage" which would return documents which have the word Chris
+followed by some text followed by the word Homepage.
+</li>
+<li>Separating query terms with a vertical bar | results in a disjunctive
+query. So a search on: <em>Chris | Pollett</em> would return pages that have
+either the word Chris or the word Pollett or both.</li>
+<li>If the query has at least one word not prefixed by -, then adding
+a `-' in front of a word in a query mean search for results not containing
+that term. So a search on: <em>of -the</em> would return results containing
+the word "of" but not containing the word "the".</li>
+<li>Searches of the forms: <b>related:url</b>, <b>cache:url</b>,
+<b>link:url</b> are equivalent to having clicked on the Similar, Cached,
+or InLinks links, respectively, on a summary with that url.</li>
+<li><b>site:url</b> or <b>site:host</b> returns all of the summaries of
+pages found at that url or on that host.
+</li>
+<li><b>info:url</b> returns the summary in the Yioop! index for the given url.
+</li>
+<li><b>filetype:extension</b> returns summaries of all documents found
+with the given extension. So a search: <em>Chris Pollett filetype:pdf</em>
+would return all documents containing the words Chris and Pollett and with
+extension pdf.</li>
+<li><b>index:timestamp</b> or <b>i:timestamp</b> causes the search to
+make use of the IndexArchive with the given timestamp. So a search like:
+<em>Chris Pollett i:1283121141 | Chris Pollett</em>
+take results from the index with timestamp 1283121141 for
+Chris Pollett and unions them with results for Chris Pollett in the default
+index</li>
+<li><b>weight:some_number</b> or <b>w:some_number</b> has the effect of
+multiplying all score for this portion of a query by some_number. For example,
+<em>Chris Pollett | Chris Pollett site:wikipedia.org w:5</em>
+would  multiply scores satisfying Chris Pollett  and on wikipedia.org by
+5 and union these with those satisfying Chris Pollett
+</li>
+
+</ul>
+<p>In
 the corner of the page with the main search form is a Settings-Signin element:
 </p>
 <img src='resources/SettingsSignin.png' alt='Settings Sign-in Element'/>

ViewGit