Updating of documents prior to v0.5, a=chris

Chris Pollett [2010-11-14 03:Nov:th]

Updating of documents prior to v0.5, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml
en-US/pages/welcome.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index 5f5ca46..0e9e273 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -29,5 +29,7 @@ combined the two to get Yioop!</p>
 <p>
 Several people helped
 with localization: Mary Pollett, Thanh Bui, Youn Kim, Sugi Widjaja,
-Chao-Hsin Shih, Sujata Dongre, and Jonathan Ben-David.
+Chao-Hsin Shih, Sujata Dongre, and Jonathan Ben-David. Thanks to
+Ravi Dhillon for finding and helping with the fixes for Issue 15
+and Commit 632e46.
 </p>
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 0f2a9e8..fae23ae 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v.42</h1>
+<h1>Yioop! Documentation v.5</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -10,6 +10,7 @@
         <li><a href="#passwords">Managing Accounts</a></li>
         <li><a href="#userroles">Managing Users and Roles</a></li>
         <li><a href="#crawls">Managing Crawls</a></li>
+        <li><a href="#mixes">Mixing Crawl Indexes</a></li>
         <li><a href="#localizing">Localizing Yioop! to a New Language</a></li>
         <li><a href="#hacking">Hacking Yioop!</a></li>
         <li><a href="#references">References</a></li>
@@ -47,9 +48,8 @@
     a million page index. This edges towards the limits of the capabilities
     of database systems although techniques like table sharding can help to
     some degree. The Yioop! engine uses a database to manage some things
-    like users and roles, but uses it own web archive format and indexing
-    technologies to handle crawl data. These will be described later in more
-    detail.</p>
+    like users and roles, but uses its own web archive format and indexing
+    technologies to handle crawl data.</p>
     <p>When the site that is being indexed consists of dynamic pages rather than
     the largely static page situation considered above, and those dynamic
     pages get most of their text content from a table column or columns,
@@ -92,16 +92,19 @@
     applied to the current page ranks estimates of a set of sites. This operation
     is reasonable easy to distribute to many machines. Computing how relevant
     a word is to a document is another
-    task that benefits from distributed computation. When a document is
-    processed by an indexer, words are extracted and stemming algorithm such as
-    [<a href="#P1980">P1980</a>] might be employed (a stemmer would extract
-    the word jump from words such as jumps, jumping, etc). Next a statistic
-    such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>] is computed to determine
-    the importance of that word in that document compared to that word amongst
-    all other documents. To do this calculation one needs to compute global
-    statistics concerning of all documents seen, such as their average-length,
-    how often a term appears in a document, etc. Each of these computations
-    benefits from  distributed computation. Infrastructure such as the Google
+    task that benefit from multi-round, distributed computation. When a document
+    is processed by an indexers on multiple machine, words are extracted and
+    stemming algorithm such as [<a href="#P1980">P1980</a>] might be employed
+    (a stemmer would extract the word jump from words such as jumps, jumping,
+    etc). Next a statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
+    is computed to determine the importance of that word in that document
+    compared to that word amongst all other documents. To do this calculation
+    one needs to compute global statistics concerning of all documents seen,
+    such as their average-length, how often a term appears in a document, etc.
+    If the crawling is distributed it might take one or more merge rounds to
+    compute these statistics based on partial computations on many machines.
+    Hence, each of these computations benefit from allowing distributed
+    computation to be multi-round. Infrastructure such as the Google
     File System [<a href="#GGL2003">GGL2003</a>], the MapReduce model [<a
     href="#DG2004">DG2004</a>],
     and the Sawzall language [<a href="#PDGQ2006">PDGQ2006</a>] were built to
@@ -122,10 +125,7 @@
     </p>
     <p>Infrastructure such as this is not trivial for a small-scale business
     or individual to deploy. On the other hand, most small businesses and
-    homes do have available <a href="http://askville.amazon.com/
-    average-number-computers-household-Televisions-family/
-    AnswerViewer.do?requestId=217711">several machines</a>
-    not all of whose computational
+    homes do have available several machines not all of whose computational
     abilities are being fully exploited. So the capability to do
     distributed crawling and indexing in this setting exists. Further
     high-speed internet for homes and small businesses is steadily
@@ -145,14 +145,15 @@
     to the coordinating computer's web server asking for messages and schedules.
     A schedule is data to process and a message has control information about
     what kind of processing should be done. The queue_server is responsible
-    for generating schedule files, but unlike the map reduce model, schedules
+    for generating schedule files, but unlike the map-reduce model, schedules
     might be sent to any fetcher. As a fetcher processes a schedule, it
     periodically POSTs the result of its computation back to the coordinating
     computer's web server. The data is then written to a set of received
     files. The queue_server as part of its loop looks for received files
-    and integrates their results into the index so far. A side-effect of this
-    computation model is that indexing needs to happen as the crawl proceeds.
-    So as soon as the crawl is over one can do text searches on the crawl.
+    and merges their results into the index so far. So the model is in a
+    sense one round: URLs are sent to the fetchers, summaries of downloaded
+    pages are sent back to the queue server and merged into the index.
+    As soon as the crawl is over one can do text searches on the crawl.
     Deploying this  computation model is relatively simple: The web server
     software needs to be installed on each machine, the Yioop! software (which
     has the the fetcher, queue_server, and web app components) is copied to
@@ -168,19 +169,25 @@
     [<a href="#APC2003">APC2003</a>]-based ranking of how important the
     link was. Yioop! supports a number of iterators which can be thought of
     as implementing a stripped-down relational algebra geared towards
-    word-document indexes (this is much the same idea as Pig). One of these
-    operators allows one to perform grouping  of document results. In the search
-    results displayed, grouping by url allows all links and documents associated
-    with a url to be grouped as one object. Scoring of this group is a sum of
-    all these scores. Thus, link text is used in the score of a document. How
-    much weight a word from a link gets also depends on the link's rank. So
-    a low-ranked link with the word "stupid" to a given site would tend not
-    to show up early in the results for the word "stupid".
+    word-document indexes (this is much the same idea as Pig). One of these
+    operators allows one to make results from unions of stored crawls. This
+    allows one to do many smaller topic specific crawls and combine them with
+    your own weighting scheme into a larger crawl. This approach is not
+    unlike topic-sensitive page ranking approaches [<a href="#H2002">H2002</a>].
+    Yioop! comes with a GUI facility to make the creation of these crawl mixes
+    easy. Another useful operator Yioop! supports allows one to perform
+    groupings  of document results. In the search results displayed,
+    grouping by url allows all links and documents associated with a url to be
+    grouped as one object. Scoring of this group is a sum of all these scores.
+    Thus, link text is used in the score of a document. How much weight a word
+    from a link gets also depends on the link's rank. So a low-ranked link with
+    the word "stupid" to a given site would tend not to show up early in the
+    results for the word "stupid".
     </p>
     <p>
     There are several open source crawlers which can scale to crawls in the
     millions to hundred of millions of pages. Most of these are written in
-    Java, C, C++, not PHP. Three important ones are <a
+    Java, C, C++, C#, not PHP. Three important ones are <a
     href="http://nutch.apache.org/">Nutch</a>/
     <a href="http://lucene.apache.org/">Lucene</a>/ <a
     href="http://lucene.apache.org/solr/">Solr</a>
@@ -195,7 +202,7 @@
     href="http://www.archive.org/">Internet Archive</a>. It was designed to do
     archival quality crawls of the web. Its ARC file format
     inspired the use of WebArchive objects in Yioop! WebArchive's are Yioop!'s
-    file format for storing web page, web summary, and index data. They
+    file format for storing web page, web summary data. They
     have the advantage of allowing one to store many small files compressed
     as one big file. They also make data from web crawls very portable,
     making them easy to copy from one location to another.
@@ -207,9 +214,10 @@
     this introduction:
     </p>
     <ul>
-    <li>Yioop! is an open source distributed crawler written in PHP.</li>
+    <li>Yioop! is an open source distributed crawler and search engine
+    written in PHP.</li>
     <li>It is capable of crawling and indexing small sites to sites or
-    collections of sites in the millions.</li>
+    collections of sites containing millions of documents.</li>
     <li>On a given machine it uses multi-curl to support many simultaneous
     downloads of pages.</li>
     <li>It has a web interface to select seed sites for crawls and set what
@@ -224,12 +232,18 @@
     deploy.</li>
     <li>It determines search results using a number of iterators which
     can be combined like a simplified relational algebra.</li>
+    <li>Yioop! supports a union operator and a GUI interface which makes
+    it easy to combine results from several crawl indexes.</li>
     <li>Indexing occurs as crawling happens, so when a crawl is stopped,
     it is ready to be used to handle search queries immediately.</li>
+    <li>Yioop! has a GUI form that allows users to specify meta words
+    to be injected into an index based on whether a downloaded document matches
+    a url pattern.</li>
     <li>Yioop! uses a web archive file format which makes it easy to
     copy crawl results amongst different machines.</li>
     <li>Using this, crawls can be mirrored amongst several machines
-    to speed-up serving search results.</li>
+    to speed-up serving search results. This can be further sped-up
+    by using memcache.</li>
     <li>A given Yioop! installation might have several saved crawls and
     it is very quick to switch between any of them and immediately start
     doing text searches.</li>
@@ -241,7 +255,8 @@
     better, (3) Curl libraries for downloading web pages. To be a little more
     specific Yioop! has been tested with Apache 2.2;
     however, it should work with other webservers. For PHP, you need a build of
-    PHP that incorporates Curl, Sqlite, GD graphics library and the
+    PHP that incorporates multi-byte string (mb_ prefixed) functions,
+    Curl, Sqlite, the GD graphics library and the
     command-line interface. If you are using Mac OSX Snow Leopard,
     the version of Apache2 and PHP that come with it suffice. For Windows,
     Mac, and Linux another easy way to get the required software is to
@@ -256,6 +271,9 @@ to

 extension=php_curl.dll
 </pre>
+<p>
+
+</p>
 <p>
     If you
     are using the Ubuntu variant of Linux, the following lines would get the
@@ -271,20 +289,20 @@ extension=php_curl.dll
     sudo apt-get install php5-gd
     </pre>
     <p>After installing the necessary software, make sure to start/restart your
-    webserver and verify that it is running.</p>
+    webserver and verify that it is running. </p>
     <h3>Memory Requirements</h3>
     <p>In addition, to the prerequisite software listed above, Yioop! also
-    has certain memory requirements. bin/queue_server.php requires 900MB,
-    bin/fetcher.php requires 550MB, and index.php requires 200MB. These
-    values are set near the tops of each of these files in turn with a line
-    like:</p>
+    has certain memory requirements. By default bin/queue_server.php
+    requires 1100MB, bin/fetcher.php requires 550MB, and index.php requires
+    200MB. These  values are set near the tops of each of these files in turn
+    with a line like:</p>
 <pre>
 ini_set("memory_limit","550M");
 </pre>
     <p>
     If you want to reduce these memory requirements, it is advisable to also
-    reduce the values for some variables in the configs/config.php file.
-    For instance, one might reduce the values of
+    reduce the values for some variables in the configs/config.php file.
+    For instance, one might reduce the values of NUM_DOCS_PER_GENERATION,
     SEEN_URLS_BEFORE_UPDATE_SCHEDULER, PAGE_RANGE_REQUEST, NUM_URLS_QUEUE_RAM,
     MAX_FETCH_SIZE, and URL_FILTER_SIZE. Experimenting with these values
     you should be able to trade-off memory requirements for speed.
@@ -874,10 +892,54 @@ php fetcher.php stop</pre>
     Such a site includes https://www.somewhere.com/foo/anything_more .</p>
     <p>When configuring a new instance of Yioop! the file default_crawl.ini
     is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings
-    for the Options form. If you want to save the options of crawls please
-    feel free to copy the crawl.ini you want to save to another name.</p>
+    for the Options form. </p>
+    <p>The last part of the Edit Crawl Options form allows you to create
+    user-defined "meta-words". In Yioop! terminology, a meta-word is a word
+    which wasn't in a downloaded document, but which is added to the
+    inverted-index as if it had been in the document. The addition of
+    user-defined meta-words is specified by giving a pattern matching rule
+    based on the url. For instance, in the figure above, the word column has
+    buyart and the url pattern column has:
+    <pre>
+    http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/
+    </pre>
+    When a url matches the pattern, a word is added in the inverted index
+    corresponding to the meta-word for that document. So when the page
+    <pre>
+    http://www.ucanbuyart.com/artistproducts/baitken/0/6/
+    </pre>
+    is crawled, the word u:buyart:artistproducts:baitkin:0:6 will be associated
+    with the document. Meta-words are useful to create shorthands for
+    searches on certains kinds of sites like dictionary sites, and wikis.
+    </p>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+
+    <h2 id='mixes'>Mixing Crawl Indexes</h2>
+    <p>Once you have performed a few crawls with Yioop!, you can use the Mix
+    Crawls activity to create mixture of your crawls. The main Mix Crawls
+    activity looks like:</p>
+    <img src='resources/ManageMixes.png' alt='The Manage Mixes form'/>
+    <p>The first form allows you to name and create a new crawl mixture.
+    Clicking "Create" sends you to a second page where you can provide
+    information about how the mixture should be built. Beneath the Create mix
+    form is a table listing all the previously created crawl mixes. The
+    first column has the name of the mix, the second column says how the
+    mix is built out of component crawls, and the actions columns allows you
+    to edit the mix, set it as the default index for Yioop! search results, or
+    delete the mix. When you create a new mix it also shows up on the Settings
+    page. Creating a new mix or editing an existing mix sends you to a second
+    page:</p>
+    <img src='resources/EditMix.png' alt='The Edit Mixes form'/>
+    <p>Using the "Back" link on this page will take you to the prior screen.
+    The first text field on the edit page lets you rename your mix if you so
+    desire. Beneath this is a table listing the current components of this
+    crawl mix. You can use this table to edit the weightings of crawl
+    components. You can also use it to delete existing components of the mix.
+    To add new components to a crawl mix use the drop-down beneath the
+    table. For changes on this page to take effect, the "Save" button beneath
+    this drop-down must be clicked.
+    </p>
     <p><a href="#toc">Return to table of contents</a>.</p>
-

     <h2 id='localizing'>Localizing Yioop! to a New Language</h2>
     <p>The Manage Locales activity can be used to configure Yioop
@@ -1035,6 +1097,11 @@ Algorithms and Models for the Web-Graph. pp. 168–180. 2004. </dd>
     >The Anatomy of a Large-Scale Hypertextual Web Search Engine</a>.
 In: Seventh International World-Wide Web Conference
 (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.</dd>
+<dt id='BCC2010'>[BCC2010]</dt><dd>S. Büttcher, C. L. A. Clarke,
+and G. V. Cormack.
+<a href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307"
+>Information Retrieval: Implementing and Evaluating Search Engines</a>.
+MIT Press. 2010.</dd>
 <dt id="DG2004">[DG2004]</dt><dd>Jeffrey Dean and Sanjay Ghemawat.
 <a href="http://labs.google.com/papers/mapreduce-osdi04.pdf"
 >MapReduce: Simplified Data Processing on Large Clusters</a>.
@@ -1044,6 +1111,11 @@ Shun-Tak Leung.
 <a href="http://labs.google.com/papers/mapreduce-osdi04.pdf">The
 Google File System</a>. 19th ACM Symposium on Operating Systems Principles.
 2003.</dd>
+<dt id='H2002'>[H2002]</dt><dd>T. Haveliwala.
+<a href="
+http://infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf">
+Topic-Sensitive PageRank</a>. Proceedings of the Eleventh International
+World Wide Web Conference (Honolulu, Hawaii). 2002.</dd>
 <dt id="KSV2010">[KSV2010]</dt><dd>Howard Karloff, Siddharth Suri, and
 Sergei Vassilvitskii.
 <a href="http://www.siam.org/proceedings/soda/2010/SODA10_076_karloffh.pdf"
@@ -1061,6 +1133,10 @@ Morgan and Claypool Publishers. 2010.</dd>
 <dt id="LM2006">[LM2006]</dt><dd>Amy N. Langville and Carl D. Meyer. <a
     href="http://press.princeton.edu/titles/8216.html">Google's
 PageRank and Beyond</a>. Princton University Press. 2006.</dd>
+<dt id="MKSR2004">[MRS2008]</dt><dd>C. D. Manning, P. Raghavan and H. Schütze.
+<a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html"
+>Introduction to Information Retrieval</a>.
+Cambridge University Press. 2008.</dd>
 <dt id="MKSR2004">[MKSR2004]</dt><dd>G. Mohr, M. Kimpton, M. Stack,
 and I.Ranitovic. <a href="http://iwaw.europarchive.org/04/Mohr.pdf"
 >Introduction to Heritrix, an archival quality web crawler</a>.
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 6157d60..0f75211 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -3,12 +3,13 @@
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
 <li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&
+h=29f8ebaac4c3c657db4e837312ba5bf07be02ff1&
+hb=1fe31141c46cef1570c75c885ed54854d3b01a72&t=zip"
+    >Version 0.5pre-ZIP</a></li>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&
 h=a33319a5bc6ed58c11af462e8645397fe2c76f27&
 hb=62925b2e560ee4460ecbd9369534544b102b2a34&t=zip"
     >Version 0.42-ZIP</a></li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&
-h=26510cc6501a231526f558f4e352362c076f2b00&t=zip"
-    >Version 0.3-ZIP</a></li>
 </ul>
 <h2>Git Repository</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
diff --git a/en-US/pages/welcome.thtml b/en-US/pages/welcome.thtml
index f5a8829..e914606 100755
--- a/en-US/pages/welcome.thtml
+++ b/en-US/pages/welcome.thtml
@@ -1,24 +1,33 @@
-<h1>Welcome!</h1>
+<h1>Welcome to the SeekQuarry Open-Source Search Engine Site!</h1>
 <p>SeekQuarry is the parent site for <a href="http://www.yioop.com/">Yioop!</a>.
-Yioop! is a <a href="http://gplv3.fsf.org/">GPLv3</a> open source search engine written in PHP.
-Yioop! can be configured  as either a general purpose
-search engine for the whole web or it can be configured to provide search results for
-a set of urls or domains.
+Yioop! is a <a href="http://gplv3.fsf.org/">GPLv3</a>, open-source, PHP search
+engine. Yioop! can be configured  as either a general purpose
+search engine for the whole web or it can be configured to provide search
+results for a set of urls or domains.
 </p>
 <h2>Goals</h2>
 <p>Yioop! was designed with the following goals in mind:</p>
 <ul>
-<li><b>To lower the barrier of entry for people wanting to obtain personal crawls of the web.</b> At present, it requires
-only a WebServer such as Apache and command line access to a default build of PHP 5.3 or better. Configuration can be
-done using a GUI interface.</li>
-<li><b>To allow for distributed crawling of the web.</b> To get a snapshot of many web pages quickly, it is useful to have more than
-one machine when crawling the web. If you have several machines at home, simply install the software
-on all the machines you would like to use in a web crawl. In the configuration interface give the URL of the machine
-you would like to serve search results from. Start the queue server on that machine and start fetchers on each of the other
-machines.</li>
-<li><b>To be reasonably fast and online.</b> The Yioop engine is "online" in the sense that it creates a word index and
-document ranking as it crawls rather than ranking as a separate step. The point is to keep the processing done by any machine as low as possible so you can still use them for what you bought them for. Nevertheless, it is reasonably fast: three Lenova Q100 fetchers and
-a 2006 MacMini queue server can crawl and index a million pages every couple days.</li>
-<li><b>To make it easy to archive crawls.</b> Crawls are stored in timestamped folders, which can be moved around zipped, etc. Through the admin
-interface you can select amongst crawls which exist in a crawl folder as to which crawl you want to serve from.</li>
+<li><b>To lower the barrier of entry for people wanting to obtain personal
+crawls of the web.</b> At present, it requires only a WebServer such as Apache
+and command line access to a default build of PHP 5.3 or better. Configuration
+can be done using a GUI interface.</li>
+<li><b>To allow for distributed crawling of the web.</b> To get a snapshot of
+many web pages quickly, it is useful to have more than one machine when crawling
+the web. If you have several machines at home, simply install the software
+on all the machines you would like to use in a web crawl. In the configuration
+interface give the URL of the machine you would like to serve search results
+from. Start the queue server on that machine and start fetchers on each of the
+other machines.</li>
+<li><b>To be reasonably fast and online.</b> The Yioop engine is "online" in the
+sense that it creates a word index and document ranking as it crawls rather
+than ranking as a separate step. The point is to keep the processing done by any
+machine as low as possible so you can still use them for what you bought them
+for. Nevertheless, it is reasonably fast: four Lenova Q100 fetchers and
+a 2006 MacMini queue server can crawl and index a million pages every couple
+days.</li>
+<li><b>To make it easy to archive crawls.</b> Crawls are stored in timestamped
+folders, which can be moved around zipped, etc. Through the admin interface you
+can select amongst crawls which exist in a crawl folder as to which crawl you
+want to serve from.</li>
 </ul>

ViewGit