Updated documentation for Version 0.76,a=chris

Chris Pollett [2011-10-01 20:Oct:st]

Updated documentation for Version 0.76,a=chris

Filename
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 0cd8a23..6386b03 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.74</h1>
+<h1>Yioop! Documentation v 0.76</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -13,6 +13,7 @@
         <li><a href="#mixes">Mixing Crawl Indexes</a></li>
         <li><a href="#filter">Search Filter</a></li>
         <li><a href="#localizing">Localizing Yioop! to a New Language</a></li>
+        <li><a href="#embedding">Embedding Yioop! in a Site</a></li>
         <li><a href="#customizing">Customizing Yioop!</a></li>
         <li><a href="#commandline">Yioop! Command-line Tools</a></li>
         <li><a href="#references">References</a></li>
@@ -21,18 +22,20 @@
     <h2 id="intro">Introduction</h2>
     <p>The Yioop! search engine is designed to allow users
     to produce indexes of a web-site or a collection of
-    web-sites whose total number of pages are in the millions. In contrast,
-    a search-engine like Google maintains an index of tens of billions
-    of pages. Nevertheless, since you, the user, have control over the exact
-    sites which are being indexed with Yioop! you have much better control
-    over the kinds of results that a search will return. In this section
-    we will discuss some of the different search engine technologies which
-    exist today, how Yioop! fits into this eco-system, and when Yioop!
-    might be the right choice for your search engine needs. In the remainder
-    of this document after the introduction, we will discuss how to get
-    and install Yioop!, the files and folders used in Yioop!,
+    web-sites whose total number of pages are in the tens of millions.
+    In contrast, a search-engine like Google maintains an index of tens of
+    billions of pages. Nevertheless, since you, the user, have control over the
+    exact sites which are being indexed with Yioop! you have much better control
+    over the kinds of results that a search will return. Yioop! provides
+    a traditional web interface to do queries, an rss api, and a function api.
+    In this section we will discuss some of the different search engine
+    technologies which exist today, how Yioop! fits into this eco-system, and
+    when Yioop! might be the right choice for your search engine needs. In the
+    remainder of this document after the introduction, we will discuss how to
+    get and install Yioop!, the files and folders used in Yioop!,
     user, role, and crawl management in the Yioop! system, localization in
-    the Yioop! system, and finally hacking Yioop!
+    the Yioop! system, embedding Yioop! in an existing, customizing Yioop!,
+    and the Yioop! command-line tools.
     </p>
     <p>Since the mid-1990s a wide variety of search engine technologies
     have been explored. Understanding some of this history is useful
@@ -61,7 +64,12 @@
     is to use a stand-alone full text index server such as <a
     href="http://www.sphinxsearch.com/">Sphinx</a>. However, for these
     approaches to work the text you are indexing needs to be in a database
-    column or columns.
+    column or columns. Nevertheless, these approaches illustrate another
+    common thread in the development of search systems: search as appliance,
+    where you either have a separate search server and access it through either
+    a web-based API or through function calls. Yioop! has both a search
+    function API as well as a web API that returns
+    <a href="http://www.opensearch.org">Open Search RSS results</a>.
     </p>
     <p>
     By 1997 commercial sites like Inktomi and AltaVista already had
@@ -325,7 +333,8 @@
     RDF files, it also supports re-indexing of data from WebArchives created
     since version 0.66.</li>
     <li>Besides standard output of a web page with ten links it is possible
-    to get query results in Open Search RSS format.</li>
+    to get query results in Open Search RSS format and also to query
+    Yioop! data via a function api.</li>
     </ul>
     <p><a href="#toc">Return to table of contents</a>.</p>

@@ -426,8 +435,8 @@ Work Directory
 form. If you are asked to sign-in before this, and you have not previously
 created accounts in this Work Directory, then the default account has login
 root, and an empty password. Once you see it, The Profile Settings form
-allows you to configure the debug,
-database, search, queue server, and robot settings. It will look
+allows you to configure the debug, search access,
+database, queue server, and robot settings. It will look
 something like:
 </p>
 <img src='resources/ConfigureScreenForm2.png' alt='The configure form'/>
@@ -443,6 +452,17 @@ systems library classes if the browser is navigated to
 http://YIOOP_INSTALLATION/tests/. None of these debug settings should
 be checked in a production environment.
 </p>
+<p>The <b>Search Access</b> field set has three check boxes:
+Web, RSS, and API. These control whether a user can use the
+web interface to get query results, whether RSS repsonses to queries
+are permitted, or whether or not the function based search API is
+available. Using the Web Search interface
+and formatting a query url to get an RSS response are
+describe in the <a href="#interface">Yioop! Search and User Interface
+section</a>. The Yioop! Search Function API is described in the
+section <a href="#embedding">Embedding Yioop!</a>, you can also look
+in the examples folder at the file search_api.php to see an example
+of how to use it.</p>
 <p>The <b>Database Set-up</b> fieldset is used to specify what database management
 system should be used, how it should be connected to, and what user name
 and password should be used for the connection. At present sqlite2
@@ -603,6 +623,8 @@ browser</dd>
 installation. Whenever the WORK_DIRECTORY is changed it is this database
 which is initially copied into the WORK_DIRECTORY to serve as the database
 of allowed users for the Yioop! system.</dd>
+<dt>examples</dt><dd>This folder contains a file search_api.php
+whose code gives an example of how to use the Yioop! search function api.</dd>
 <dt>lib</dt><dd>This folder is short for library. It contains all the common
 classes for things like indexing, storing data to files, parsing urls, etc.
 lib contains six subfolders: <i>archive_bundle_iterators</i>,
@@ -1418,22 +1440,25 @@ OdpRdfArchiveBundle
     look like: 'fr' => 'FrStemmer' .
     </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-
-    <h2 id='customizing'>Customizing Yioop!</h2>
-    <p>One advantage of an open-source project is that you have complete
-    access to the source code. Thus, you can modify Yioop! to fit in
-    with your existing project or add new feel free to add new features to
-    Yioop! In this section, we look a little bit at some common ways you
-    might try to modify Yioop! as well as ways to examine the output of a
-    crawl in a more technical manner. If you decide to modify the source code
-    it is recommended you look at the <a
-    href="#files">Summary of Files and Folders</a> above again, as well
-    as look at the <a href="http://www.seekquarry.com/yioop-docs/">online
-    Yioop! documentation</a>.</p>
-    <h3>Adding a Search Field to an Existing Project</h3>
+    <h2 id='embedding'>Embedding Yioop! in an Existing Site</h2>
     <p>One use-case for Yioop! is to use it to serve search result for your
-    site. In which case, you want to have a form that goes to Yioop!
-    on some page of your site. A very minimal code snippet for such a
+    site. There are three common ways to do this: (1)
+    On your site have a web-form or links with your installation of Yioop!
+    as their target and let Yioop! format the results. (2) Use the
+    same kind of form or links, but request an OpenSearch RSS Response from
+    Yioop! and then you format the results and display the results within
+    your site. (3) Your site makes functions calls of the Yioop! Search
+    API and gets either PHP arrays or a string back and then does what it
+    wants with the results. For access method (1) and (2) it is possible to
+    have Yioop! on an different machine so that it doesn't consume your main
+    web-site's machines resources. As we mentioned in the configuration section
+    it is possible to disable each of these access paths from within the Admin
+    portion of the web-site. This might be useful for instance if you are using
+    access methods (2) or (3) and don't want users to be able to access the
+    Yioop! search results via its built in web form. We will now spend a moment
+    to look at each of these access methods in more detail...</p>
+    <h3>Accessing Yioop! via an Existing Web Form</h3>
+    <p>A very minimal code snippet for such a
     form would be:</p>
     <pre>
 &lt;form method="get" action='YIOOP_LOCATION'&gt;
@@ -1443,16 +1468,114 @@ OdpRdfArchiveBundle
 &lt;button type="submit"&gt;Search&lt;/button&gt;
 &lt;/form&gt;
     </pre>
-    <p>In the above form, you should change YIOOP_LOCATION to where you have
-    installed Yioop!, TIMESTAMP_OF_CRAWL_YOU_WANT should be the Unix timestamp of
-    which appeasrs in the name of the IndexArchive folder that you want Yioop! to
-    serve results from, LOCALE_TAG should be the locale you want results
-    displayed in, for example, en-US for American English. In addition, to
-    embedding this form on some page on your site, you would
+    <p>In the above form, you should change YIOOP_LOCATION to your instance of
+    Yioop!'s web location, TIMESTAMP_OF_CRAWL_YOU_WANT should be the Unix
+    timestamp that appears in the name of the IndexArchive folder that you want
+    Yioop! to serve results from, LOCALE_TAG should be the locale you want
+    results displayed in, for example, en-US for American English. In addition,
+    to  embedding this form on some page on your site, you would
     probably want to change the resources/yioop.png image to something more
     representative of your site. You might also want to edit the file
     views/search_view.php to give a link back to your site from the
     search results.</p>
+    <p>If you had a form such as above, clicking Search would take you
+    to the URL:</p>
+<pre>
+    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;q=QUERY
+</pre>
+    <p>where QUERY was what was typed in the search form. Yioop! supports two
+    other kinds of queries: Related sites queries and cache look-up queries.
+    The related query format is:</p>
+<pre>
+    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;a=related&amp;arg=URL
+</pre>
+    <p>where URL is the url that you are looking up related URLs for. To do a
+    look up of the Yioop! cache of a web page the url format is:</p>
+<pre>
+    YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&amp;l=LOCALE_TAG&amp;q=QUERY&amp;a=cache&amp;arg=URL
+</pre>
+    <p>Here the terms listed in QUERY will be styled in different colors in the
+    web page that is returned; URL is the url of the web page you want to look
+    up in the cache.
+    </p>
+    <h3>Accessing Yioop! and getting and OpenSearch RSS Response</h3>
+    <p>The same basic urls as above can return RSS results simply by appending
+    to the end of the them &ampf=rss. This of course only makes sense for
+    usual and related url queries -- cache queries return web-pages not
+    a list of search results. An example of a portion of an RSS result might
+    look like:</p>
+<pre>
+&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
+&lt;rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
+xmlns:atom="http://www.w3.org/2005/Atom"
+&gt;
+    &lt;channel&gt;
+        &lt;title&gt;PHP Search Engine - Yioop! : art&lt;/title&gt;
+        &lt;language&gt;en-US&lt;/language&gt;
+        &lt;link&gt;http://localhost/git/yioop/?f=rss&amp;amp;q=art&amp;amp;its=1317152828&lt;/link&gt;
+        &lt;description&gt;Search results for: art&lt;/description&gt;
+        &lt;opensearch:totalResults&gt;1105&lt;/opensearch:totalResults&gt;
+        &lt;opensearch:startIndex&gt;0&lt;/opensearch:startIndex&gt;
+        &lt;opensearch:itemsPerPage&gt;10&lt;/opensearch:itemsPerPage&gt;
+        &lt;atom:link rel="search" type="application/opensearchdescription+xml"
+            href="http://localhost/git/yioop/yioopbar.xml"/&gt;
+        &lt;opensearch:Query role="request" searchTerms="art"/&gt;
+
+                &lt;item&gt;
+                &lt;title&gt; An Online Fine Art Gallery U Can Buy Art  -
+                Buy Fine Art Online&lt;/title&gt;
+
+                &lt;link&gt;http://www.ucanbuyart.com/&lt;/link&gt;
+                &lt;description&gt; UCanBuyArt.com is an online art gallery
+                and dealer designed... art gallery and dealer designed for art
+                sales of high quality and original... art sales of high quality
+                and original art from renowned artists. Art&lt;/description&gt;
+                &lt;/item&gt;
+                ...
+                ...
+    &lt;/channel&gt;
+
+&lt;/rss&gt;
+</pre>
+    <p>Notice the opensearch: tags tell us the totalResults, startIndex and
+    itemsPerPage. The opensearch:Query tag tells us what the search terms
+    were.</p>
+    <h3>Accessing Yioop! via the Function API</h3>
+    <p>The last way we will consider to get search results out of Yioop! is
+    via its function API. The Yioop! Function API consists of the following
+    three methods in controllers/search_controller.php :
+    </p>
+    <pre>
+    public function queryRequest($query, $results_per_page, $limit = 0)
+
+    public function relatedRequest($url, $results_per_page, $limit = 0,
+        $crawl_time = 0)
+
+    public function cacheRequest($url, $highlight=true, $terms ="",
+        $crawl_time = 0)
+    </pre>
+    <p>These methods handle basic queries, related queries, and cache of
+    web page requests respectively. The arguments of the first two
+    are reasonably self-explanatory. The $highlight and $terms arguments
+    to cacheRequest are to specify whether or not you want syntax highlighting
+    of any of the words in the returned cached web-page. If wanted then
+    $terms should be a space separated list of terms.</p>
+    <p>An example script showing what needs to be set-up before invoking
+    these methods as well as how to extract results from what is returned
+    can be found in the file examples/search_api.php .</p>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='customizing'>Customizing Yioop!</h2>
+    <p>One advantage of an open-source project is that you have complete
+    access to the source code. Thus, you can modify Yioop! to fit in
+    with your existing project or add new feel free to add new features to
+    Yioop! In this section, we look a little bit at some common ways you
+    might try to modify Yioop! as well as ways to examine the output of a
+    crawl in a more technical manner. If you decide to modify the source code
+    it is recommended you look at the <a
+    href="#files">Summary of Files and Folders</a> above again, as well
+    as look at the <a href="http://www.seekquarry.com/yioop-docs/">online
+    Yioop! documentation</a>.</p>
+
     <h3>Handling new File Types</h3>
     <p>One relatively easy enhancement to Yioop! would be to enhance
     the way it processes an existing file type or to get it to process
@@ -1576,17 +1699,37 @@ OdpRdfArchiveBundle
     The command-line script bin/arc_tool.php can be use to examine the
     contents of a WebArchiveBundle or an IndexArchiveBundle. i.e., it gives
     a print out of the web pages or summaries contained therein. It can also
-    be used to give information from the headers of these bundles. It is
-    run from the command-line with the syntaxes:
+    be used to give information from the headers of these bundles. Finally,
+    it can be used to re-index an IndexArchiveBundle's dictionary based
+    on the contents of the partial dictionaries in each of the bundles
+    posting_doc_shards. arc_tool is run from the command-line with the syntaxes:
     </p>
     <pre>
-php arc_tool.php info bundle_name
-    //return info about documents stored in archive.
-php arc_tool.php list bundle_name start num
-    //outputs items start through num from bundle_name
+php arc_tool.php list //returns a list
+//of all the archives in the Yioop! crawl directory.
+
+php arc_tool.php info bundle_name //return info about
+//documents stored in archive.
+
+php arc_tool.php show bundle_name start num //outputs
+//items start through num from bundle_name
+
+php arc_tool.php reindex bundle_name
+//reindex the word dictionary in bundle_name
    </pre>
-   <p>For example,</p>
+   <p>The bundle name can be a full path name, a relative path from
+   the current directory, or it can be just the bundle directory's file
+   name in which case WORK_DIRECTORY/cache will be assumed to be the
+   bundle's location. The following are some examples of using arc tool.
+   They are not all from the same session:</p>
    <pre>
+|chris-polletts-macbook-pro:bin:108&gt;php arc_tool.php list
+Archive1191586964
+IndexData1191586964
+chris-polletts-macbook-pro:bin:109&gt;
+
+...
+
 |chris-polletts-macbook-pro:bin:158&gt;php arc_tool.php info /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731

 Bundle Name: IndexData1293767731
@@ -1610,7 +1753,7 @@ Meta Words:
    http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/

 |chris-polletts-macbook-pro:bin:159&gt;
-|chris-polletts-macbook-pro:bin:202&gt;php arc_tool.php list /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3
+|chris-polletts-macbook-pro:bin:202&gt;php arc_tool.php show /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3

 BEGIN ITEM, LENGTH:21098
 [URL]
@@ -1630,6 +1773,17 @@ ASCII
 &lt;head&gt;
     &lt;base href="http://www.ucanbuyart.com/" /&gt;
    &lt;/pre&gt;
+...
+
+|chris-polletts-macbook-pro:bin:117&gt;php arc_tool.php reindex IndexData1317414152
+
+Shard 0
+[Sat, 01 Oct 2011 11:05:17 -0700] Adding shard data to dictionary files...
+[Sat, 01 Oct 2011 11:05:28 -0700] Merging tiers of dictionary
+
+Final Merge Tiers
+
+Reindex complete!!
 </pre>
     <h3>Querying an Index from the command-line</h3>
 <p>    The command-line script bin/query_tool.php can be use to query
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 568fd1d..80adfd4 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,12 +2,11 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=d0058b709eac3907ca302d3060712fafb5915822&hb=16a6d216f159af3d4c3413bf69021a6910ecae09&t=zip"    >Version 0.76-ZIP</a></li>
+</li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=5e8353236fed2ffcf87f8671baa1f4e5d54381b9&hb=fe23effb2f16949a73d85c13b6ebe2039d1b4387&t=zip"
     >Version 0.741-ZIP</a></li>
 </li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=c4aa1557604578a2b7c9b801c71a831a20242ffb&hb=6fd42f91a0de1c542f89556accb7ff44713efe28&t=zip"
-    >Version 0.721-ZIP</a></li>
-</li>
 </ul>
 <h2>Git Repository</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to

ViewGit