Preparing docs for version 0.66, a=chris

Chris Pollett [2011-01-28 01:Jan:th]

Preparing docs for version 0.66, a=chris

Filename
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/home.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 1d001a0..15b6be9 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.6</h1>
+<h1>Yioop! Documentation v 0.66</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -93,7 +93,7 @@
     operation is reasonable easy to distribute to many machines. Computing how
     relevant a word is to a document is another
     task that benefit from multi-round, distributed computation. When a document
-    is processed by indexers on multiple machine, words are extracted and
+    is processed by indexers on multiple machine, words are extracted and a
     stemming algorithm such as [<a href="#P1980">P1980</a>] might be employed
     (a stemmer would extract the word jump from words such as jumps, jumping,
     etc). Next a statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
@@ -205,13 +205,49 @@
     a web crawler developed at the <a
     href="http://www.archive.org/">Internet Archive</a>. It was designed to do
     archival quality crawls of the web. Its ARC file format
-    inspired the use of WebArchive objects in Yioop! WebArchive's are Yioop!'s
-    file format for storing web page, web summary data. They
-    have the advantage of allowing one to store many small files compressed
-    as one big file. They also make data from web crawls very portable,
-    making them easy to copy from one location to another. Like Nutch and
-    Heritrix, Yioop! also has a command line tool for quickly looking at the
-    contents of such archive objects.
+    inspired the use of WebArchive objects in Yioop!. WebArchives are Yioop!'s
+    container file format for storing web pages, web summary data, url lists,
+    and other kinds of data used by Yioop!. A WebArchive is essentially a
+    linked-list of compressed, serialized PHP objects, the last element in this
+    list containing a header object with information like version number and a
+    total count of objects stored. The compression format can be chosen to
+    suit the kind of objects being stored. The header can be used to store
+    auxiliary data structures into the list if desired. One nice aspect of
+    serialized PHP objects versus serialized Java Objects is that they are
+    humanly readable text strings. The main purpose of
+    Web Archives is to allow one to store
+    many small files compressed as one big file. They also make data from web
+    crawls very portable, making them easy to copy from one location to another.
+    Like Nutch and Heritrix, Yioop! also has a command line tool for quickly
+    looking at the contents of such archive objects.
+    </p>
+    <p>The <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC
+    format</a> is one example of an archival file format for web data. Besides
+    at the Internet Archive, ARC and its successor
+    <a href="
+    http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml">WARC
+    format</a> are often used by TREC conferences to store test data sets such
+    as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the
+    <a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">ClueWeb Dataset</a>.
+    In addition, it is used by <a href="http://grub.org/">grub.org</a>, a
+    distributed, open-source, search engine project in C#.
+    Another important format for archiving web pages is the XML format used by
+    <a href="http://www.wikipedia.org/">Wikipedia</a> for archiving MediaWiki
+    wikis. Wikipedia offers <a
+    href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">creative
+    common licenses downloads</a>
+    of their site in this format. The <a href="http://www.dmoz.org/">Open
+    Directory Project</a> makes available its <a
+    href="http://www.dmoz.org/rdf.html">ODP data set</a> in an RDF-like format
+    licensed using the Open Directory License. Thus, we see that there are many
+    large scale useful data sets that can be easily licensed. Raw data dumps
+    do not contain indexes of the data though. This makes sense because indexing
+    technology is constantly improving and it is always possible to re-index
+    old data. Yioop! supports importing and indexing data from ARC,
+    MediaWiki XML dumps, and Open Directory RDF, it also supports re-indexing of
+    old Yioop! data files created after version 0.66. This means using Yioop!
+    you can have searchable access to many data sets as well as have the
+    ability to maintain your data going forward.
     </p>
     <p>
     This concludes the discussion of how Yioop! fits into the current and
@@ -255,6 +291,9 @@
     <li>A given Yioop! installation might have several saved crawls and
     it is very quick to switch between any of them and immediately start
     doing text searches.</li>
+    <li>Yioop! supports importing data from ARC, MediaWiki XML, and ODP
+    RDF files, it also supports re-indexing of data from WebArchives created
+    since version 0.66.</li>
     </ul>
     <p><a href="#toc">Return to table of contents</a>.</p>

@@ -354,8 +393,8 @@ Work Directory
 form. If you are asked to sign-in before this, and you have not previously
 created accounts in this Work Directory, then the default account has login
 root, and an empty password. Once you see it, The Profile Settings form
-allows you to configure the debug settings,
-database settings, queue server and robot settings. It will look
+allows you to configure the debug,
+database, search, queue server, and robot settings. It will look
 something like:
 </p>
 <img src='resources/ConfigureScreenForm2.png' alt='The configure form'/>
@@ -391,6 +430,19 @@ which has privileges on all activities. Since different databases associated
 with a Yioop! installation might have different user accounts set-up after
 changing database information you might have to sign in again.
 </p>
+<p>The <b>Search Auxiliary Links Displayed</b> fieldset is used to tell
+you which links you would like to have presented on the search landing and
+search results pages. The Signin checkbox controls whether to display the
+link to the page for users to sign in to Yioop!  The Cache checkbox toggles
+whether a link to the cache of a search item should be displayed as part
+of each search result. The Similar checkbox toggles
+whether a link to similar search items should be displayed as part
+of each search result. The Inlinks checkbox toggles
+whether a link for inlinks to a search item should be displayed as part
+of each search result. Finally, the IP address checkbox toggles
+whether a link for pages with the same ip address should be displayed as part
+of each search result.</p>
+
 <p>The <b>Queue Server Set-up</b> fieldset is used to tell Yioop! which machine
 is going to act as a queue server during a crawl and what secret string
 to use to make sure that communication is being done between
@@ -480,10 +532,10 @@ about who is crawling their sites. Here is a rough guide to what
 the Yioop! folder's sub-folder contain:
 <dl>
 <dt>bin</dt><dd>This folder is intended to hold command line scripts
-which are used in conjunction with Yioop! In addition, to fetcher.php
-and queue_server.php, it contains arc_tool.php which can be used to
-examine the contents of WebArchiveBundle's and IndexArchiveBundle's from
-the command line.</dd>
+which are used in conjunction with Yioop! In addition to the fetcher.php
+and queue_server.php script already mentioned, it contains arc_tool.php
+which can be used to examine the contents of WebArchiveBundle's and
+IndexArchiveBundle's from the command line.</dd>
 <dt>configs</dt><dd>This folder contains configuration files. You will
 probably not need to edit any of these files directly as you can set the most
 common configuration settings from with the admin panel of Yioop! The file
@@ -594,7 +646,7 @@ ArchiveUNIX_TIMESTAMP, IndexDataUNIX_TIMESTAMP, and QueueBundleUNIX_TIMESTAMP.
 ArchiveUNIX_TIMESTAMP folders hold complete caches of web pages that have been
 crawled. These folders will appear on machines which are running fetcher.php.
 IndexDataUNIX_TIMESTAMP folders hold a word document index as well as summaries
-of pages crawled.  A folder of this type is needed by the web app
+of pages crawled. A folder of this type is needed by the web app
 portion of Yioop! to serve search results. These folders can be moved from
 machine to whichever machine you want to
 server results from. QueueBundleUNIX_TIMESTAMP folders are used to maintain
@@ -615,7 +667,7 @@ to say what it has just crawled, the web app writes data into these
 folders to be processed later by the queue_server. The UNIX_TIMESTAMP
 is used to keep track of which crawl the data is destined for. IndexData
 folders contain mini-inverted indexes (word document records) which are
-to be added to the global inverted index for that crawl.
+to be added to the global inverted index (called the dictionary) for that crawl.
 RobotData folders contain information that came from robots.txt files.
 Finally, ScheduleData folders have data about found urls that could
 eventually be scheduled to crawl. Within each of these three kinds of folders
@@ -655,8 +707,10 @@ around to any location by simply moving the folder. However, if an archive
 is moved off the network on which fetcher lives, then the look up of a
 cached page might fail. Clicking on Similar causes Yioop! to locate the five
 words with the highest relevancy scores for that document and then to perform
-a search on those words. Finally, clicking on InLinks will take you to a page
+a search on those words. Clicking on InLinks will take you to a page
 consisting of all the links that Yioop! found to the document in question.
+Finally, clicking on an IP address link returns all documents that were
+crawled from that IP address.
 </p>
 <p>A basic query to the Yioop! search form is a typically a sequence of
 words seperated by whitespace. This will cause Yioop! to compute a
@@ -676,12 +730,13 @@ followed by some text followed by the word Homepage.
 query. So a search on: <em>Chris | Pollett</em> would return pages that have
 either the word Chris or the word Pollett or both.</li>
 <li>If the query has at least one word not prefixed by -, then adding
-a `-' in front of a word in a query mean search for results not containing
+a `-' in front of a word in a query means search for results not containing
 that term. So a search on: <em>of -the</em> would return results containing
 the word "of" but not containing the word "the".</li>
 <li>Searches of the forms: <b>related:url</b>, <b>cache:url</b>,
-<b>link:url</b> are equivalent to having clicked on the Similar, Cached,
-or InLinks links, respectively, on a summary with that url.</li>
+<b>link:url</b>, <b>ip:ip_address</b> are equivalent to having clicked on the
+Similar, Cached, InLinks, IP address links, respectively, on a summary with
+that url and ip address.</li>
 <li><b>site:url</b> or <b>site:host</b> returns all of the summaries of
 pages found at that url or on that host.
 </li>
@@ -691,6 +746,25 @@ pages found at that url or on that host.
 with the given extension. So a search: <em>Chris Pollett filetype:pdf</em>
 would return all documents containing the words Chris and Pollett and with
 extension pdf.</li>
+<li><b>server:web_server_name</b> returns summaries of all documents
+served on that kind of web server. For example, <i>server:apache</i>.</li>
+<li><b>version:version_number</b> returns summaries of all documents
+served on web servers with the given version number.
+For example, one might have a query <i>server:apache version:2.2.9</i>.</li>
+<li><b>os:operating_system</b>  returns summaries of all documents
+served on servers using the given operating system. For example,
+<i>os:centos</i>, make sure to use lower case.</li>
+<li><b>lang:IETF_language_tag</b>  returns summaries of all documents
+whose language can be determined to match the given language tag.
+For example, <i>lang:en-US</i>.</li>
+<li><b>date:Y</b>, <b>date:Y-M</b>, <b>date:Y-M-D</b>
+returns summaries of all documents crawled on the given date.
+For example, <i>date:2011-01</i> returns all document crawled in
+January, 2011.</li>
+<li><b>modified:Y</b>, <b>modified:Y-M</b>, <b>modified:Y-M-D</b>
+returns summaries of all documents which were last modified on the given date.
+For example, <i>modified:2010-02</i> returns all document which were last
+modifed in February, 2010.</li>
 <li><b>index:timestamp</b> or <b>i:timestamp</b> causes the search to
 make use of the IndexArchive with the given timestamp. So a search like:
 <em>Chris Pollett i:1283121141 | Chris Pollett</em>
@@ -869,18 +943,31 @@ php fetcher.php stop</pre>
     <h3>Specifying Crawl Options</h3>
     <p>As we pointed out above, next to the Start Crawl button is an Options
     link. Clicking on this link, should display the following activity:</p>
-<img src='resources/CrawlOptions.png' alt='Crawl Options Form'/>
-    <p>The Back link in the corner returns one to the previous activity.
-    The first form field, "Get Crawl Options From", allows one to read in
-    crawl options either from the default_crawl.ini file or from the crawl
-    options used in a previous crawl. The rest of the form allows the user to
-    change the existing crawl options. The second form field is labeled Crawl
-    Order. This can be set to either Bread First or Page Importance. It
-    specifies the order in which pages will be crawled. In breadth first
-    crawling, roughly all the seeds sites are visited first, followed by sites
-    linked directly from seed sites, followed by sites linked directly from
-    sites linked directly from seed sites, etc. Page Importance is our
-    modification of [<a href="#APC2003">APC2003</a>]. In this
+<img src='resources/WebCrawlOptions.png' alt='Web Crawl Options Form'/>
+    <p>The Back link in the corner returns one to the previous activity.</p>
+    <p>There are two kinds of crawls that can be performed by Yioop!
+    either a crawl of sites on the web or a crawl of data that has been
+    previously stored in a supported archive format such as data that was
+    crawled by Versions 0.66 and above of Yioop!,
+    <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">Internet
+    Archive arc file</a>,
+    <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"
+    >MediaWiki xml dump</a>, and
+    <a href="http://rdf.dmoz.org/"
+    >Open Directory Project RDF file</a>. We will first concentrate on
+    new web crawls and then return to archive crawls later.</p>
+    <h4>Web Crawl Options</h4>
+    <p>
+    On the web crawl tab, the first form field, "Get Crawl Options From",
+    allows one to read in crawl options either from the default_crawl.ini file
+    or from the crawl options used in a previous crawl. The rest of the form
+    allows the user to change the existing crawl options. The second form field
+    is labeled Crawl Order. This can be set to either Bread First or Page
+    Importance. It specifies the order in which pages will be crawled. In
+    breadth first crawling, roughly all the seeds sites are visited first,
+    followed by sites linked directly from seed sites, followed by sites linked
+    directly from sites linked directly from seed sites, etc. Page Importance is
+    our modification of [<a href="#APC2003">APC2003</a>]. In this
     order, each seed sites starts with a certain quantity of money.
     When a site is crawled it distributes its money equally amongst sites
     it links to. When picking sites to crawl next, one chooses those that
@@ -944,6 +1031,56 @@ php fetcher.php stop</pre>
     with the document. Meta-words are useful to create shorthands for
     searches on certains kinds of sites like dictionary sites, and wikis.
     </p>
+    <h4>Archive Crawl Options</h4>
+    <p>We now consider how to do crawls of previously obtained archives.
+    From the initial crawl options screen clicking on the Archive Crawl
+    tab gives one the following form:</p>
+<img src='resources/ArchiveCrawlOptions.png' alt='Archive Crawl Options Form'/>
+    <p>The drop down lists all previously done crawls that are available for
+    recrawl. These include both previously done Yioop crawls and crawls
+    of other file formats such as arc, MediaWiki XML, and ODP RDF which
+    have been appropriately prepared in the PROFILE_DIR/cache folder.
+    You might want to re-crawl an existing Yioop! crawl if you want to add
+    new meta-words or if you are migrating a crawl from an older version
+    of Yioop! for which the index isn't readable by your newer version of
+    Yioop! You might want to do an archive crawl of other file formats
+    if you want Yioop! to be able to provide search results of their content.
+    Once you have selected the archive you want to crawl, you can add meta
+    words as discussed in the previous section and then save your options
+    and go back to the Create Crawl screen to start your crawl. As with
+    a Web Crawl, for an archive crawl you need both the queue_server
+    running and a least one fetcher running to perform a crawl. To re-crawl
+    an archive that was made with several fetchers, each of the fetchers
+    that was used in the creation process should be running.</p>
+    <p>To get Yioop to detect arc, MediaWiki, and ODP RDF files you need
+    to create an PROFILE_DIR/cache/IndexData(timestamp) folder on the queue
+    server machine containing the single file arc_description.txt. This
+    text files contexts should just be the name you would like for your
+    data. In the Archive Crawl drop down this name will appear with the
+    prefix ARCFILE:: and you can then select it as the source to crawl.
+    To actually crawl anything though for each fetcher machine that you would
+    like to take part in the archive crawl, you should make a folder
+    PROFILE_DIR/cache/Archive(same_timestamp). In this folder you should
+    have a text file arc_type.txt saying what kind of archive bundle this is.
+    If you want to archive crawl arc files you would have a single line:</p>
+    <pre>
+ArcArchiveBundle
+    </pre>
+    <p>For Media Wiki xml, the line would be:</p>
+    <pre>
+MediaWikiArchiveBundle
+    </pre>
+    <p>And for Open Directory RDF, the line would be:</p>
+    <pre>
+OdpRdfArchiveBundle
+    </pre>
+    <p>Then in this folder (not a subdirectory thereof) you would also put
+    instances of the files in question that you would like to archive crawl.
+    So for arc files, these would be files of extension .arc.gz; for MediaWiki,
+    files of extension .xml.bz2; and for ODP-RDF, files of extension .rdf.u8.gz
+    .
+    </p>
+
     <p><a href="#toc">Return to table of contents</a>.</p>

     <h2 id='mixes'>Mixing Crawl Indexes</h2>
@@ -1024,6 +1161,32 @@ php fetcher.php stop</pre>
     prefix db_ (such as the names of activities) are stored in the database.
     So you cannot find these ids in the source code. The tooltip trick
     mentioned above does not work for database string ids.</p>
+
+    <h3>Adding a stemmer for your language</h3>
+    <p>Depending on the language you are localizing to, it make sense
+    to write a stemmer for words that will be inserted into the index.
+    A stemmer takes inflected or sometimes derived words and reduces
+    them to their stem. For instance, jumps and jumping would be reduced to
+    jump in English. As Yioop! crawls it attempts to detect the language of
+    a given web page it is processing. If a stemmer exists for this language
+    it will call the stemmer's stem($word) method on each word it extracts
+    from the document before inserting information about it into the index.
+    Similarly, if an end-user is entering a simple conjunctive search query
+    and a stemmer exists for his language settings, then the query terms will
+    be stemmed before being looked up in the index. Currently, Yioop! comes
+    with only an English language stemmer that uses the Porter Stemming
+    Algorithm [<a href="#P1980">P1980</a>]. This stemmer is located in the
+    file lib/stemmers/en_stemmer.php . The [<a href="#P1980">P1980</a>] link
+    points to a site that has source code for stemmers for many other languages
+    (unfortunately,  not written in PHP). It would not be hard to port these
+    to PHP and then add a file to the lib/stemmers folder. For instance, one
+    could add a file fr_stemmer.php containing a class FrStemmer with method
+    stem($word) if one wanted to add a stemmer for French. To get Yioop! to use
+    your stemmer you would then edit the file lib/phrase_parser.php . At the
+    start of this file there is a static associative array $STEMMERS. You
+    would add an entry for your stemmer to this array, for French this would
+    look like: 'fr' => 'FrStemmer' .
+    </p>
     <p><a href="#toc">Return to table of contents</a>.</p>

     <h2 id='hacking'>Hacking Yioop!</h2>
@@ -1161,7 +1324,7 @@ ASCII
 &lt;html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"&gt;

 &lt;head&gt;
-	&lt;base href="http://www.ucanbuyart.com/" /&gt;
+    &lt;base href="http://www.ucanbuyart.com/" /&gt;
    &lt;/pre&gt;
 ....

diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 1f983eb..129b47f 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,11 +2,10 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=847d8bcce4f259660ecd28557e662d73da8056aa&hb=848255db69f196262b004856e5985251658884ee&t=zip"
+    >Version 0.66-ZIP</a></li>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=b26f8183abd26f1fa72f54edd3953e196dfa4e78&hb=9ff688c92eed7f7090cf5eb4c710abed32255932&t=zip"
     >Version 0.62-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=21440ff6f620fe99477546701d01070488b6636d&
-hb=181d421bb7151a62939a18b6a843864d888f015e&t=zip"
-    >Version 0.52-ZIP</a></li>
 </ul>
 <h2>Git Repository</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
diff --git a/en-US/pages/home.thtml b/en-US/pages/home.thtml
index d798529..081e94d 100755
--- a/en-US/pages/home.thtml
+++ b/en-US/pages/home.thtml
@@ -9,7 +9,7 @@ results for a set of urls or domains.
 <p>Yioop! was designed with the following goals in mind:</p>
 <ul>
 <li><b>To lower the barrier of entry for people wanting to obtain personal
-crawls of the web.</b> At present, it requires only a WebServer such as Apache
+crawls of the web.</b> At present, it requires only a web server such as Apache
 and command line access to a default build of PHP 5.3 or better. Configuration
 can be done using a GUI interface.</li>
 <li><b>To allow for distributed crawling of the web.</b> To get a snapshot of
@@ -30,4 +30,9 @@ days.</li>
 folders, which can be moved around zipped, etc. Through the admin interface you
 can select amongst crawls which exist in a crawl folder as to which crawl you
 want to serve from.</li>
+<li><b>To make it easy to crawl archives.</b> There are many sources of
+raw web data available today such as files that use the Internet Archive's
+arc format, Open Directory Project RDF data, Wikipedia xml dumps, etc. Yioop!
+can index these formats directly, allowing one to get an index for these
+high-value content sites without needing to do an exhaustive crawl.</li>
 </ul>

ViewGit