Yet more work on describing ranking in Yioop, a=chris

Chris Pollett [2013-04-22 20:Apr:nd]

Yet more work on describing ranking in Yioop, a=chris

Filename
en-US/pages/ranking.thtml

diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 3d0b004..ee2d315 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -518,7 +518,7 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     <p>To save a fair bit of crawling
     overhead, Yioop does not keep for each site crawled historical totals of all
     earnings a page has received, the cash-based approach is only used for
-    scheduling. Here are some of the changes and issues addressed in the
+    scheduling. Here are some of the issues addressed in the
     OPIC-based algorithm employed by Yioop:
     </p>
     <ul>
@@ -527,18 +527,45 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     into the queue before any page from that site. Until the robots.txt
     file for a page is crawled, it receives cash whenever a page on
     that host receives cash.</li>
-    <li>The cash that a robots.txt file receives is be divided amongst
-    any sitemap links on that page. Together with the last point,
-    this means cash totals no longer sum to one.</li>
+    <li>A fraction `alpha` of the cash that a robots.txt file receives is
+    divided amongst any sitemap links on that page. Not all of the cash is
+    given to prevent sitemaps from "swamping" the queue. Currently,
+    `alpha` is set 0.25. Nevertheless, together with the last bullet point,
+    the fact that we do share some cash, means cash totals no longer
+    sum to one.</li>
     <li>Cash might go missing for several reasons: (a) An image page,
     any other page, might be downloaded with no outgoing links. (b)
     A page might receive cash
-    and later the scheduler receives robots.txt information saying it cannot
+    and later the Scheduler receives robots.txt information saying it cannot
     be crawled. (c) Round-off errors due to floating point precision.
     For these reasons, the Scheduler periodically renormalizes
     the total amount of cash..</li>
-    <li>A robots.txt file or a slow host might cause the scheduler to
-    crawl-delay all the pages on the host.</li>
+    <li>A robots.txt file or a slow host might cause the Scheduler to
+    crawl-delay all the pages on the host. These pages might receive sufficient
+    cash to be scheduled earlier, but won't be because there must be a minimum
+    time gap between requests to that host.</li>
+    <li>When a schedule is made with a
+    crawl-delayed host, URLs from that host cannot be scheduled until the
+    fetcher that was processing them completes its schedule. If a Scheduler
+    receives a "to crawl" url from a crawl-delayed host, and there are
+    already MAX_WAITING_HOSTS many crawl-delayed hosts in the queue,
+    then Yioop discards the url.
+    </li>
+    <li>The Scheduler has a maximum, in-memory queue size based on
+    NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It
+    will wait on reading new "to crawl" schedule files from fetchers,
+    if reading in the file would mean going over this count. For a typical,
+    web crawl this means the "to crawl" files build up much like a breadth-first
+    queue on disk.
+    </li>
+    <li>To make a schedule, the Scheduler starts processing the queue
+    from highest priority to lowest. The up to 5000 urls in the schedule
+    are split into slots of 100, where each slot of 100 will be required by the
+    fetcher to take at MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls
+    are inserted into the schedule at the earliest available position. If
+    a URL is crawl-delayed it is inserted at the earliest position in the
+    slot sufficient far from any previous url for that host to ensure that
+    the crawl-delay condition is met.</li>
     </ul>
 How data is split amongst Fetchers, Queue Servers and Name Servers
 Web Versus Archive Crawl

ViewGit