Update seekquary for Version 0.78, a=chris

Chris Pollett [2011-10-28 07:Oct:th]

Update seekquary for Version 0.78, a=chris

Filename
en-US/pages/about.thtml
en-US/pages/documentation.thtml
en-US/pages/downloads.thtml

diff --git a/en-US/pages/about.thtml b/en-US/pages/about.thtml
index db7e43b..d3c84d3 100755
--- a/en-US/pages/about.thtml
+++ b/en-US/pages/about.thtml
@@ -29,10 +29,11 @@ combined the two to get Yioop!</p>
 <p>
 Several people helped
 with localization: Mary Pollett, Jonathan Ben-David,
-Thanh Bui, Sujata Dongre, Youn Kim, Chao-Hsin Shih,
+Thanh Bui, Sujata Dongre, Animesh Dutta,
+ Youn Kim, Akshat Kukreti, Vijeth Patil, Chao-Hsin Shih,
 and Sugi Widjaja. Thanks to
-Ravi Dhillon for finding and helping with the fixes for Issue 15
-and Commit 632e46. Several of my master's students have done projects
+Ravi Dhillon and Tanmayee Potluri for creating patches for Yioop! issues.
+Several of my master's students have done projects
 related to Yioop!: Amith Chandranna, Priya Gangaraju, and Vijaya Pamidi.
 Amith's code related to an Online version of the HITs algorithm
 is not currently in the main branch of Yioop!, but it is
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 6386b03..cbb32b0 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -1,5 +1,5 @@
 <div class="docs">
-<h1>Yioop! Documentation v 0.76</h1>
+<h1>Yioop! Documentation v 0.78</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
         <li><a href="#intro">Introduction</a></li>
@@ -18,7 +18,7 @@
         <li><a href="#commandline">Yioop! Command-line Tools</a></li>
         <li><a href="#references">References</a></li>
     </ul>
-
+
     <h2 id="intro">Introduction</h2>
     <p>The Yioop! search engine is designed to allow users
     to produce indexes of a web-site or a collection of
@@ -104,11 +104,14 @@
     relevant a word is to a document is another
     task that benefit from multi-round, distributed computation. When a document
     is processed by indexers on multiple machines, words are extracted and a
-    stemming algorithm such as [<a href="#P1980">P1980</a>] might be employed
-    (a stemmer would extract the word jump from words such as jumps, jumping,
-    etc). Next a statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
-    is computed to determine the importance of that word in that document
-    compared to that word amongst all other documents. To do this calculation
+    stemming algorithm such as [<a href="#P1980">P1980</a>] or a character
+    n-gramming technique might be employed (a stemmer would extract the word
+    jump from words such as jumps, jumping, etc; converting jumping to 3-grams
+    would make terms of length 3, i.e., jum, ump, mpi, pin, ing). Next a
+    statistic such as BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]
+    (or at least the non-query time part of it) is computed to determine the
+    importance of that word in that document compared to that word amongst
+    all other documents. To do this calculation
     one needs to compute global statistics concerning all documents seen,
     such as their average-length, how often a term appears in a document, etc.
     If the crawling is distributed it might take one or more merge rounds to
@@ -1258,8 +1261,8 @@ php fetcher.php stop</pre>
     <p>To get Yioop to detect arc, MediaWiki, and ODP RDF files you need
     to create an PROFILE_DIR/cache/IndexData(timestamp) folder on the queue
     server machine containing the single file arc_description.txt. This
-    text files contexts should just be the name you would like for your
-    data. In the Archive Crawl drop down this name will appear with the
+    text file's contents should just be the name you would like for your
+    data. In the Archive Crawl drop-down this name will appear with the
     prefix ARCFILE:: and you can then select it as the source to crawl.
     To actually crawl anything though for each fetcher machine that you would
     like to take part in the archive crawl, you should make a folder
@@ -1414,7 +1417,8 @@ OdpRdfArchiveBundle
     So you cannot find these ids in the source code. The tooltip trick
     mentioned above does not work for database string ids.</p>

-    <h3>Adding a stemmer for your language</h3>
+    <h3>Adding a stemmer or supporting character
+    n-gramming for your language</h3>
     <p>Depending on the language you are localizing to, it make sense
     to write a stemmer for words that will be inserted into the index.
     A stemmer takes inflected or sometimes derived words and reduces
@@ -1439,6 +1443,25 @@ OdpRdfArchiveBundle
     would add an entry for your stemmer to this array, for French this would
     look like: 'fr' => 'FrStemmer' .
     </p>
+    <p>In addition to supporting the ability to add stemmers, Yioop also
+    supports another technique which can be used in lieu of a stemmer
+    called character n-grams. When used this technique segments text into
+    sequences of n characters which are then stored in Yioop! as a term.
+    For instance if n were 3 then the word "thunder" would be split
+    into "thu", "hun", "und", "nde", and "der" and each of these would be
+    asscociated with the document that contained the word thunder.
+    N-grams are useful for languages like Chinese and Japanese in which
+    words in the text are often not separated with spaces. It is also
+    useful for languages like German which can have long compound words.
+    The drawback of n-grams is that they tend to make the index larger.
+    A list of languages which will be n-grammed by Yioop! can be
+    found in lib/phrase_parser.php in the global variable $CHARGRAMS.
+    This is an associative array of language tag => n to use for
+    char gramming. It should be noted if you decide to add a more
+    memory efficient stemmer for a language, then you should remove
+    the entry for the language tag from $CHARGRAMS. If you add a
+    language to Yioop! and want to use char gramming merely and an
+    additional entry to this array.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='embedding'>Embedding Yioop! in an Existing Site</h2>
     <p>One use-case for Yioop! is to use it to serve search result for your
@@ -1833,8 +1856,8 @@ Score: 4.03
 <p>The index the results are returned from is the default index; however,
 all of the Yioop! meta words should work so you can do queries like
 "my_query i:timestamp_of_index_want". Query results depend on the
-kind of language stemmer being used, so French results might be better
-if one specifies fr-FR then if one relies on the default en-US.</p>
+kind of language stemmer/char-gramming being used, so French results might be
+better if one specifies fr-FR then if one relies on the default en-US.</p>
     <h2 id="references">References</h2>
     <dl>
 <dt id="APC2003">[APC2003]</dt>
diff --git a/en-US/pages/downloads.thtml b/en-US/pages/downloads.thtml
index 80adfd4..2ab6f64 100755
--- a/en-US/pages/downloads.thtml
+++ b/en-US/pages/downloads.thtml
@@ -2,18 +2,23 @@
 <h2>Yioop! Releases</h2>
 <p>The Yioop! source code is still at an alpha stage. </p>
 <ul>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=d0058b709eac3907ca302d3060712fafb5915822&hb=16a6d216f159af3d4c3413bf69021a6910ecae09&t=zip"    >Version 0.76-ZIP</a></li>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=db58568d4957782dc85f875be3592b2b951e53a3&hb=d28a3af2b3574c17fb8425d340984fb02fcfb4a5&t=zip" >Version 0.78-ZIP</a></li>
 </li>
-<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=5e8353236fed2ffcf87f8671baa1f4e5d54381b9&hb=fe23effb2f16949a73d85c13b6ebe2039d1b4387&t=zip"
-    >Version 0.741-ZIP</a></li>
+<li><a href="http://www.seekquarry.com/viewgit/?a=archive&p=yioop&h=d0058b709eac3907ca302d3060712fafb5915822&hb=16a6d216f159af3d4c3413bf69021a6910ecae09&t=zip"
+    >Version 0.76-ZIP</a></li>
 </li>
 </ul>
-<h2>Git Repository</h2>
+<h2>Git Repository / Contributing</h2>
 <p>The Yioop! git repository allows anonymous read-only access. If you would to
-contribute to Yioop, just do a clone of the most recent code, make your changes,
-do a pull, and make a patch. For example, to clone the repository
-assuming you have git, type:</p>
+contribute to Yioop!, just do a clone of the most recent code,
+make your changes, do a pull, and make a patch. For example, to clone the
+repository  assuming you have git, type:</p>
 <p><b>git clone https://seekquarry.com/git/yioop.git</b></p>
 <p>
-Create/update an issue in the <a href="/mantis/">Yioop issue tracker</a>
-describing what your patch solves and upload the patch.</p>
+Create/update an issue in the <a href="/mantis/">Yioop! issue tracker</a>
+describing what your patch solves and upload the patch. To contribute
+localizations, you can use the GUI interface in your own
+copy of Yioop! to enter in your localizations. Next locate in the locale
+folder of your Yioop! work directory, the locale tag of the
+language you added translations for. Within this folder is a configure.ini
+file, just make an issue in the issue tracker and upload this file there.</p>

ViewGit