More work on describing ranking in Yioop, a=chris

Chris Pollett [2013-04-22 18:Apr:nd]
More work on describing ranking in Yioop, a=chris
Filename
en-US/pages/ranking.thtml
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 55c09a4..3d0b004 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -440,12 +440,33 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     </p>
     <p>To make this a score out of 10, we can use logarithms:</p>
     <p>`mbox(DOC_RANK) = 10 - log_(10)(mbox(RANK)).`</p>
-    <p>So this gives us a DOC_RANK for one link or summary item stored
+    <p>This gives us a Doc Rank for one link or summary item stored
     in a Yioop index. However, as we will see, this does not give us the
-    complete value of DOC_RANK when computed at query time.</p>
-    <p>Index shards are important for determining relevance and proximity
-    scores as well. An index shard stores the number of doc seen,
-    number of links seen, the sum of the lengths of all summaries, the
+    complete value of Doc Rank for an item when computed at query time.
+    There also some things to note about this formula:</p>
+    <ol>
+    <li>Unlike PageRank [<a href="#BP1998">BP1998</a>], it is not
+    some kind of logarithm of a probability, it is the logarithm
+    of a rank. A log probability would preserve information about relative
+    importance of two pages. I.e., it could say something about how
+    far apart things like the number 1 page was compared to the number 2
+    page. Doc Rank as measured so far does not do that.</li>
+    <li>The Doc Rank is a positive number and less than 10 provided the index
+    of the given queue server has fewer than 10 billion items. Since
+    to index 10 billion items using Yioop you would probably want
+    multiple queue servers, Doc Rank's likely remain positive for larger
+    indexes.</li>
+    <li>Doc Rank is computed by different queue servers independent of each
+    other for the same index. So it is possible for two summaries to
+    have the same Doc Rank in the same index if they are stored on different
+    queue server.</li>
+    <li>For Doc Ranks to be comparable with each other for the same index on
+    different queue servers, it is assumed that queue servers
+    are indexing at roughly the same speed.</li>
+    </ol>
+    <p>Besides Doc Rank, Index shards are important for determining relevance
+    and proximity scores as well. An index shard stores the number of summaries
+    seen, number of links seen, the sum of the lengths of all summaries, the
     sum of the length of all links. From these we can derive average
     summary lengths, and average link lengths. From a posting, the
     number of occurences of a term in a document can be calculated.
@@ -454,9 +475,71 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     obtained for the particular shard the summary occurs in as a proxy
     for their value throughout all shards. The fact that a posting
     contains a position list of the location of a term within a
-    document will be use when we calculate poximity scores.</p>
+    document will be use when we calculate proximity scores.</p>
     <p>We next turn to the role of a Queue Server's Scheduler in
-    the computation of a page's Doc Rank.</p>
+    the computation of a page's Doc Rank. One easy way, which is supported
+    by Yioop, for a Scheduler to determine what to crawl next is to
+    use a simple queue. This would yield roughly a breadth-first traversal of
+    the web starting from the seed sites. Since highly quality pages are often a
+    small number of hops, from any page on the web, there is some evidence
+    [<a href="NW2001">NW2001</a>] that this lazy strategy is not too
+    bad for crawling roughly according to document importance. However, there
+    are better strategies. When Page Importance is chosen as
+    the Crawl Order for a Yioop crawl, the Scheduler on each queue server works
+    harder  to make schedules so that the next pages to crawl are always the
+    most important pages not yet seen.</p>
+    <p>One well-known algorithm for doing
+    this kind of scheduling is called OPIC (Online Page Importance Computation)
+    [<a href="#APC2003">APC2003</a>].
+    The idea of OPIC is that at the start of a crawl one divides up an initial
+    dollar of cash equally among the starting seed sites. One then picks
+    a site with highest cash value to crawl next. If this site had `alpha`
+    cash value, then when we crawl it and extract links, we divide up the
+    cash and give it equally to each link. So if there were `n` links,
+    each link would receive from the site `alpha/n` cash. Some of these
+    sites might already have been in the queue in which case we add to
+    their cash total. For URLs not in the queue, we add them to the queue
+    with initial value `alpha/n`. Each site has two scores: Its current
+    cash on hand, and total earnings the site has ever received. When
+    a page is crawled, its cash on hand is reset to 0. We always choose
+    as the next page to crawl from amongst the pages with the most cash
+    (there might be ties). OPIC can be used to get an estimate of the
+    importance of a page, by taking its total earnings and dividing it by
+    the total earnings received by all pages in the course of a crawl.
+    </p>
+    <p>In experiments in the original paper, OPIC was shown to crawl in a better
+    approximation to page rank order than bread-first search. Bidoki
+    and Yazdani [<a href="#BY2008">BY2008</a>] have more recently proposed
+    a new page importance measure DistanceRank, they
+    also confirm that OPIC does better than breadth-first, but show the
+    computationally more expensive PartialPageRank and Partial DistanceRank
+    perform even better. Yioop uses a modified version
+    of OPIC to choose which page to crawl next.</p>
+    <p>To save a fair bit of crawling
+    overhead, Yioop does not keep for each site crawled historical totals of all
+    earnings a page has received, the cash-based approach is only used for
+    scheduling. Here are some of the changes and issues addressed in the
+    OPIC-based algorithm employed by Yioop:
+    </p>
+    <ul>
+    <li>A Scheduler must ensure robots.txt files are crawled before
+    any other page on the host. To do this, robots.txt files are inserted
+    into the queue before any page from that site. Until the robots.txt
+    file for a page is crawled, it receives cash whenever a page on
+    that host receives cash.</li>
+    <li>The cash that a robots.txt file receives is be divided amongst
+    any sitemap links on that page. Together with the last point,
+    this means cash totals no longer sum to one.</li>
+    <li>Cash might go missing for several reasons: (a) An image page,
+    any other page, might be downloaded with no outgoing links. (b)
+    A page might receive cash
+    and later the scheduler receives robots.txt information saying it cannot
+    be crawled. (c) Round-off errors due to floating point precision.
+    For these reasons, the Scheduler periodically renormalizes
+    the total amount of cash..</li>
+    <li>A robots.txt file or a slow host might cause the scheduler to
+    crawl-delay all the pages on the host.</li>
+    </ul>
 How data is split amongst Fetchers, Queue Servers and Name Servers
 Web Versus Archive Crawl
 The order in which something is crawled. Opic or Breadth-first
@@ -498,6 +581,12 @@ How related queries work
 In: Proceedings of the 12th international conference on World Wide Web.
 pp. 280-290. 2003.
 </dd>
+<dt id='BP1998'>[BP1998]</dt>
+<dd>Brin, S. and Page, L.
+<a  href="http://infolab.stanford.edu/~backrub/google.html"
+    >The Anatomy of a Large-Scale Hypertextual Web Search Engine</a>.
+In: Seventh International World-Wide Web Conference
+(WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.</dd>
 <dt id="CCB2009">[CCB2009]</dt>
 <dd>Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher.
 <a href="http://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf"
@@ -508,11 +597,28 @@ and Development in Information Retrieval. pp.758--759. 2009.
 </dd>

 <dt id="LLWL2009">[LLWL2009]</dt>
-<dd>H.-T. Lee, D. Leonard, X. Wang, D. Loguinov.
+<dd>H.-T. Lee, D. Leonard, X. Wang, D. Loguinov.
 <a href="http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf"
 >IRLbot: Scaling to 6 Billion Pages and Beyond</a>.
 ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
 </dd>
+
+<dt id="BY2008">[BY2008]</dt>
+<dd>A. M. Z. Bidoki and Nasser Yazdani.
+<a href="http://goanna.cs.rmit.edu.au/~aht/tiger/DistanceRank.pdf"
+>DistanceRank: An intelligent ranking algorithm for web pages</a>.
+Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892.
+March, 2008.
+</dd>
+
+<dt id="NW2001">[NW2001]</dt>
+<dd>Marc Najork and Janet L. Wiener.
+<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.9301
+&rep=rep1&type=pdf"
+>Breadth-First Search Crawling Yields High-Quality Pages</a>.
+Proceedings of the 10th international conference on World Wide Web.
+pp 114--118. 2001.
+</dd>
 <dt id="VLZ2012">[VLZ2012]</dt>
 <dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel.
 <a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf"
ViewGit