more ranking changes, a=chris
more ranking changes, a=chris
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index edec05a..2ad6bf0 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -2,7 +2,7 @@
<h1>Yioop Documentation v 0.94</h1>
<h2 id='toc'>Table of Contents</h2>
<ul>
- <li><a href="#quick">Preface: Quick Start Guides</a></li>
+ <li><a href="#quick">Getting Started</a></li>
<li><a href="#intro">Introduction</a></li>
<li><a href="#features">Feature List</a></li>
<li><a href="#requirements">Requirements</a></li>
@@ -26,12 +26,15 @@
<li><a href="#commandline">Yioop Command-line Tools</a></li>
<li><a href="#references">References</a></li>
</ul>
- <h2 id="quick">Preface: Quick Start Guides</h2>
+ <h2 id="quick">Getting Started</h2>
<p>This document serves as a detailed reference for the
- Yioop search engine. If you want to get started using Yioop now,
- but perhaps in less detail, you might want to first read the
+ Yioop search engine. If you want to get started using Yioop now,
+ you probably want to first read the
<a href="?c=main&p=install">Installation
- Guides</a> page.
+ Guides</a> page. If you cannot find your particular machine configuration
+ there, you can check the Yioop <a href="#requirements">Requirements</a>
+ section followed by the more general <a
+ href="#installation">Installation and Configuration</a> instructions.
</p>
<h2 id="intro">Introduction</h2>
<p>The Yioop search engine is designed to allow users
@@ -317,7 +320,8 @@
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml">WARC
format</a> are often used by TREC conferences to store test data sets such
as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the
- <a href="http://lemurproject.org/clueweb09/">ClueWeb Dataset</a>.
+ <a href="http://lemurproject.org/clueweb09/">ClueWeb 2009</a> /
+ <a href="http://lemurproject.org/clueweb12/">ClueWeb 2012</a> Datasets.
In addition, it was used by grub.org (hopefully, only on a
temporary hiatus), a distributed, open-source, search engine project in C#.
Another important format for archiving web pages is the XML format used by
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index d9963fc..046a713 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -425,7 +425,8 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
now point to locations at the end of the summary of the IndexShard.
These offsets thus provide information about when a document was indexed
during the crawl process. The maximum number of links per document
- is usually 50 for normal documents and 300 for sitemaps. Emperically,
+ is usually 50 for normal documents and 300 for
+ <a href="http://www.sitemaps.org/">sitemaps</a>. Emperically,
it has been observed that a typical index shard has offsets for around
24 times as many links summary maps as document summary maps. So
roughly, if a newly added summary or link has index <i>DOC_INDEX</i>
@@ -566,10 +567,18 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
a URL is crawl-delayed it is inserted at the earliest position in the
slot sufficient far from any previous url for that host to ensure that
the crawl-delay condition is met.</li>
+ <li>If a Scheduler's queue is full, yet after going through all
+ of the url's in the queue it cannot find any to write to a schedule
+ it goes into a reset mode. It dumps its current urls back to
+ schedule files, starts with a fresh queue (but preserving robots.txt
+ info) and starts reading in schedule files. This can happen if too many
+ urls of crawl-delayed sites start clogging a queue.</li>
</ul>
-<p>The actual giving of a page's cash to its urls is done in the Fetcher. This
-is actually done in a different manner than in the OPIC paper. It is further
-handled different for sitemap pages versus all other web pages. For a
+<p>The actual giving of a page's cash to its urls is done in the Fetcher.
+We discuss it in the section on the queue server because it directly
+affects the order of queue processing. Cash in Yioop's algorithm
+is done in a different manner than in the OPIC paper. It is further
+handled differently for sitemap pages versus all other web pages. For a
sitemap page with `n` links, let<p>
<p class="center">`\gamma = sum_(j=1)^n 1/j^2`.</p>
Let `C` denote the cash that the sitemap has to distribute. Then the `i`th
@@ -603,14 +612,29 @@ to its links will sum to C. If no links go out of the CLD, then cash
will be lost. In the case where someone is deliberately doing a crawl
of only one site, then this lost cash will get replaced during normalization,
and the above scheme essentially reduces to usual OPIC.</p>
+<p>We conclude this section by mentioning that the Scheduler only
+ affects when a URL is written to a schedule which will then be
+used by a fetcher. It is entirely possible that two fetchers get consecutive
+schedules from the same Scheduler, and return data to the Indexers
+not in the order in which they were scheduled. In which case, they would
+be indexed out of order and their Doc Ranks would not be in the order
+of when they were scheduled. The scheduling and indexing process is
+only approximately correct, we rely on query time manipulations to
+try to improve the accuracy.</p>
<p><a href="#toc">Return to table of contents</a>.</p>
<h2 id='search'>Search Time Ranking Factors</h2>
-calculateControlWords (SearchController)
-
-guessSemantics (PhraseModel)
+<p>We are at last in a position to describe how Yioop calculates
+ the three scores Doc Rank, Relevance, and Proximity at query time. When
+a query comes into Yioop it goes through the following stages before an actual
+look up is performed against an index.
+</p>
+<ol>
+<li>Control words are calculated.</li>
+<li>An attempt is made to guess the semantics of the query.</li>
+<li>Stemming or character n-gramming is done on the query and acronyms
+and abbreviations are rewritten.</li>
+</ol>
-stemming word gramming
-special characters and acronyms
Network Versus non network queries
Grouping (links and documents) deduplication
Conjunctive queries