Yet more work on describing ranking in Yioop, a=chris
Yet more work on describing ranking in Yioop, a=chris
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 3d0b004..ee2d315 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -518,7 +518,7 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
<p>To save a fair bit of crawling
overhead, Yioop does not keep for each site crawled historical totals of all
earnings a page has received, the cash-based approach is only used for
- scheduling. Here are some of the changes and issues addressed in the
+ scheduling. Here are some of the issues addressed in the
OPIC-based algorithm employed by Yioop:
</p>
<ul>
@@ -527,18 +527,45 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
into the queue before any page from that site. Until the robots.txt
file for a page is crawled, it receives cash whenever a page on
that host receives cash.</li>
- <li>The cash that a robots.txt file receives is be divided amongst
- any sitemap links on that page. Together with the last point,
- this means cash totals no longer sum to one.</li>
+ <li>A fraction `alpha` of the cash that a robots.txt file receives is
+ divided amongst any sitemap links on that page. Not all of the cash is
+ given to prevent sitemaps from "swamping" the queue. Currently,
+ `alpha` is set 0.25. Nevertheless, together with the last bullet point,
+ the fact that we do share some cash, means cash totals no longer
+ sum to one.</li>
<li>Cash might go missing for several reasons: (a) An image page,
any other page, might be downloaded with no outgoing links. (b)
A page might receive cash
- and later the scheduler receives robots.txt information saying it cannot
+ and later the Scheduler receives robots.txt information saying it cannot
be crawled. (c) Round-off errors due to floating point precision.
For these reasons, the Scheduler periodically renormalizes
the total amount of cash..</li>
- <li>A robots.txt file or a slow host might cause the scheduler to
- crawl-delay all the pages on the host.</li>
+ <li>A robots.txt file or a slow host might cause the Scheduler to
+ crawl-delay all the pages on the host. These pages might receive sufficient
+ cash to be scheduled earlier, but won't be because there must be a minimum
+ time gap between requests to that host.</li>
+ <li>When a schedule is made with a
+ crawl-delayed host, URLs from that host cannot be scheduled until the
+ fetcher that was processing them completes its schedule. If a Scheduler
+ receives a "to crawl" url from a crawl-delayed host, and there are
+ already MAX_WAITING_HOSTS many crawl-delayed hosts in the queue,
+ then Yioop discards the url.
+ </li>
+ <li>The Scheduler has a maximum, in-memory queue size based on
+ NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It
+ will wait on reading new "to crawl" schedule files from fetchers,
+ if reading in the file would mean going over this count. For a typical,
+ web crawl this means the "to crawl" files build up much like a breadth-first
+ queue on disk.
+ </li>
+ <li>To make a schedule, the Scheduler starts processing the queue
+ from highest priority to lowest. The up to 5000 urls in the schedule
+ are split into slots of 100, where each slot of 100 will be required by the
+ fetcher to take at MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls
+ are inserted into the schedule at the earliest available position. If
+ a URL is crawl-delayed it is inserted at the earliest position in the
+ slot sufficient far from any previous url for that host to ensure that
+ the crawl-delay condition is met.</li>
</ul>
How data is split amongst Fetchers, Queue Servers and Name Servers
Web Versus Archive Crawl