more ranking changes, a=chris

Chris Pollett [2013-04-23 21:Apr:rd]

more ranking changes, a=chris

Filename
en-US/pages/documentation.thtml
en-US/pages/ranking.thtml

diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index edec05a..2ad6bf0 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -2,7 +2,7 @@
 <h1>Yioop Documentation v 0.94</h1>
     <h2 id='toc'>Table of Contents</h2>
     <ul>
-        <li><a href="#quick">Preface: Quick Start Guides</a></li>
+        <li><a href="#quick">Getting Started</a></li>
         <li><a href="#intro">Introduction</a></li>
         <li><a href="#features">Feature List</a></li>
         <li><a href="#requirements">Requirements</a></li>
@@ -26,12 +26,15 @@
         <li><a href="#commandline">Yioop Command-line Tools</a></li>
         <li><a href="#references">References</a></li>
     </ul>
-    <h2 id="quick">Preface: Quick Start Guides</h2>
+    <h2 id="quick">Getting Started</h2>
     <p>This document serves as a detailed reference for the
-    Yioop search engine. If you want to get started using Yioop now,
-    but perhaps in less detail, you might want to first read the
+    Yioop search engine. If you want to get started using Yioop now,
+    you probably want to first read the
     <a href="?c=main&p=install">Installation
-    Guides</a> page.
+    Guides</a> page. If you cannot find your particular machine configuration
+    there, you can check the Yioop <a href="#requirements">Requirements</a>
+    section followed by the more general <a
+    href="#installation">Installation and Configuration</a> instructions.
     </p>
     <h2 id="intro">Introduction</h2>
     <p>The Yioop search engine is designed to allow users
@@ -317,7 +320,8 @@
     http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml">WARC
     format</a> are often used by TREC conferences to store test data sets such
     as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the
-    <a href="http://lemurproject.org/clueweb09/">ClueWeb Dataset</a>.
+    <a href="http://lemurproject.org/clueweb09/">ClueWeb 2009</a> /
+    <a href="http://lemurproject.org/clueweb12/">ClueWeb 2012</a> Datasets.
     In addition, it was used by grub.org (hopefully, only on a
     temporary hiatus), a distributed, open-source, search engine project in C#.
     Another important format for archiving web pages is the XML format used by
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index d9963fc..046a713 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -425,7 +425,8 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     now point to locations at the end of the summary of the IndexShard.
     These offsets thus provide information about when a document was indexed
     during the crawl process. The maximum number of links per document
-    is usually 50 for normal documents and 300 for sitemaps. Emperically,
+    is usually 50 for normal documents and 300 for
+    <a href="http://www.sitemaps.org/">sitemaps</a>. Emperically,
     it has been observed that a typical index shard has offsets for around
     24 times as many links summary maps as document summary maps. So
     roughly, if a newly added summary or link has index <i>DOC_INDEX</i>
@@ -566,10 +567,18 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     a URL is crawl-delayed it is inserted at the earliest position in the
     slot sufficient far from any previous url for that host to ensure that
     the crawl-delay condition is met.</li>
+    <li>If a Scheduler's queue is full, yet after going through all
+    of the url's in the queue it cannot find any to write to a schedule
+    it goes into a reset mode. It dumps its current urls back to
+    schedule files, starts with a fresh queue (but preserving robots.txt
+    info) and starts reading in schedule files. This can happen if too many
+    urls of crawl-delayed sites start clogging a queue.</li>
     </ul>
-<p>The actual giving of a page's cash to its urls is done in the Fetcher. This
-is actually done in a different manner than in the OPIC paper. It is further
-handled different for sitemap pages versus all other web pages. For a
+<p>The actual giving of a page's cash to its urls is done in the Fetcher.
+We discuss it in the section on the queue server because it directly
+affects the order of queue processing. Cash in Yioop's algorithm
+is done in a different manner than in the OPIC paper. It is further
+handled differently for sitemap pages versus all other web pages. For a
 sitemap page with `n` links, let<p>
 <p class="center">`\gamma = sum_(j=1)^n 1/j^2`.</p>
 Let `C` denote the cash that the sitemap has to distribute. Then the `i`th
@@ -603,14 +612,29 @@ to its links will sum to C. If no links go out of the CLD, then cash
 will be lost. In the case where someone is deliberately doing a crawl
 of only one site, then this lost cash will get replaced during normalization,
 and the above scheme essentially reduces to usual OPIC.</p>
+<p>We conclude this section by mentioning that the Scheduler only
+ affects when a URL is written to a schedule which will then be
+used by a fetcher. It is entirely possible that two fetchers get consecutive
+schedules from the same Scheduler, and return data to the Indexers
+not in the order in which they were scheduled. In which case, they would
+be indexed out of order and their Doc Ranks would not be in the order
+of when they were scheduled. The scheduling and indexing process is
+only approximately correct, we rely on query time manipulations to
+try to improve the accuracy.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='search'>Search Time Ranking Factors</h2>
-calculateControlWords (SearchController)
-
-guessSemantics (PhraseModel)
+<p>We are at last in a position to describe how Yioop calculates
+ the three scores Doc Rank, Relevance, and Proximity at query time. When
+a query comes into Yioop it goes through the following stages before an actual
+look up is performed against an index.
+</p>
+<ol>
+<li>Control words are calculated.</li>
+<li>An attempt is made to guess the semantics of the query.</li>
+<li>Stemming or character n-gramming is done on the query and acronyms
+and abbreviations are rewritten.</li>
+</ol>

-stemming word gramming
-special characters and acronyms
 Network Versus non network queries
 Grouping (links and documents) deduplication
 Conjunctive queries

ViewGit