More on ranking, a=chris

Chris Pollett [2013-04-23 00:Apr:rd]

More on ranking, a=chris

Filename
en-US/pages/ranking.thtml

diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index ee2d315..d9963fc 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -561,27 +561,48 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     <li>To make a schedule, the Scheduler starts processing the queue
     from highest priority to lowest. The up to 5000 urls in the schedule
     are split into slots of 100, where each slot of 100 will be required by the
-    fetcher to take at MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls
+    fetcher to take a MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls
     are inserted into the schedule at the earliest available position. If
     a URL is crawl-delayed it is inserted at the earliest position in the
     slot sufficient far from any previous url for that host to ensure that
     the crawl-delay condition is met.</li>
     </ul>
-How data is split amongst Fetchers, Queue Servers and Name Servers
-Web Versus Archive Crawl
-The order in which something is crawled. Opic or Breadth-first
-Company level domains
-robots.txt crawl delay.
-Queue size in ram. Schedule on disk.
-Page Range Request
-Mimetype
-Summary Extraction. Title description link extraction
-(what are important elements on page for html.
-Page Rules
-Statistics come from mini inverted indexes, not whole crawl.
-Stemming or char gramming
-n-gram word filter
-special characters and acronyms
+<p>The actual giving of a page's cash to its urls is done in the Fetcher. This
+is actually done in a different manner than in the OPIC paper. It is further
+handled different for sitemap pages versus all other web pages. For a
+sitemap page with `n` links, let<p>
+<p class="center">`\gamma = sum_(j=1)^n 1/j^2`.</p>
+Let `C` denote the cash that the sitemap has to distribute. Then the `i`th
+link on the sitemap page receives cash</p>
+<p class="center">`C_i = C/(gamma cdot i^2)`.</p>
+<p>One can verify that `sum_(i=1)^n C_i = C`. This weighting tends to favor
+links early in the sitemap and prevent crawling of sitemap links from
+clustering together too much. For a non-sitemap page, we split the cash by
+making use of the notion of a company level domain (cld). This is a slight
+simplification of the notion of a pay level domain defined (pld) in [<a
+href="#LLWL2009">LLWL2009</a>]. For a host of the form,
+something.2chars.2chars or blah.something.2chars.2chars, the company
+level domain is something.2chars.2chars. For example, for www.yahoo.co.uk,
+the company level domain is yahoo.co.uk. For any other url,
+stuff.2ndlevel.tld, the company level domain is 2ndlevel.tld. For example,
+for www.yahoo.com, the company level domain is yahoo.com. To distribute
+cash to links on a page, we first compute the company level domain for
+the hostname of the url of the page, then for each link we compute its
+company level domain. Let `n` denote the number of
+links on the page and let `s` denote the number of links with the
+ same company level domain. If the cld of a link is the same as that the page,
+and the page had cash `C`, then the link will receive cash:
+</p>
+<p class='center'>`frac{C}{2n}`</p>
+<p>Notice this is half what it would get under usual OPIC. On the
+other hand, links to a different cld will receive cash:</p>
+<p class='center'>`frac{C - s times C/(2n)}{n-s}`</p>
+<p>The idea is to avoid link farms with a lot of internal links. As long
+as there is at least one link to a different cld, the payout of a page
+to its links will sum to C. If no links go out of the CLD, then cash
+will be lost. In the case where someone is deliberately doing a crawl
+of only one site, then this lost cash will get replaced during normalization,
+and the above scheme essentially reduces to usual OPIC.</p>
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id='search'>Search Time Ranking Factors</h2>
 calculateControlWords (SearchController)

ViewGit