add ranking page, a=chris

Chris Pollett [2013-04-21 17:Apr:st]

add ranking page, a=chris

Filename
en-US/pages/ranking.thtml

diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
new file mode 100644
index 0000000..55c09a4
--- /dev/null
+++ b/en-US/pages/ranking.thtml
@@ -0,0 +1,529 @@
+<div class="docs">
+<h1>Yioop Ranking Mechanisms</h1>
+    <h2 id='toc'>Table of Contents</h2>
+    <ul>
+        <li><a href="#intro">Introduction</a></li>
+        <li><a href="#crawl">Crawl Time Ranking Factors</a>
+            <ul>
+            <li><a href="#crawl-processes">Crawl Processes</a></li>
+            <li><a href="#fetchers">Fetchers and their Effect on Search
+                Ranking</a></li>
+            <li><a href="#queue-servers">Queue Servers and their Effect on
+                Search Ranking</a></li>
+            </ul>
+        </li>
+        <li><a href="#search">Search Time Ranking Factors</a></li>
+        <li><a href="#references">References</a></li>
+    </ul>
+    <h2 id='intro'>Introduction</h2>
+    <p>
+    A typical query to Yioop is a collection of terms without the use
+    of the OR operator, '|', or the use of the exact match operator, double
+    quotes around a phrase. On such a query, called a <b>conjunctive query</b>,
+    Yioop tries to return documents which contain all of the query terms.
+    Yioop further tries to return these documents in descending order of score.
+    Most users only look at the first ten of the results returned. This article
+    tries to explain the different factors which influence whether a page that
+    has all the terms will make it into the top ten. To keep things simple
+    we will assume that the query is being performed on a single Yioop
+    index rather than a crawl mix of several indexes. We will also ignore
+    how news feed search items get incorporated into results.
+    </p>
+    <p>At its heart, Yioop currently relies on three main scores
+    for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox).
+    Proximity scores are only used if the query has two or more terms.
+    We will describe later how these three scores are calculated.
+    For now one can think that the Doc Rank roughly indicates how important
+    the document as a whole is, Relevance measures how important the search
+    terms are to the document, and Proximity measures how close the search terms
+    appear to each other on the document.
+    </p>
+    </p>
+    On a given query, Yioop does not scan its whole posting lists to find
+    every document that satisfies the query. Instead, it scans until it finds
+    a fixed number of documents, say `n`, satisfying the query. It then
+    computes the three scores for each of these `n` documents. For a document
+    `d` from these `n` documents, it determines the rank of `d` with respect to
+    the Doc Rank score, the rank of `d` with respect to the Relevance score,
+    and the rank of `d` with respect
+    to the Proximity score. It finally computes a score for each  of these
+    `n` documents using these three rankings and the so-called
+    <b>reciprocal rank fusion  (RRF)</b>:</p>
+<p class="center">
+`\R\R\F(d) := 200(frac{1}{59 + mbox(Rank)_(DR)(d)} +
+frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
+</p><p>
+    This formula essentially comes from Cormack et al.
+    [<a href="#CCB2009">CCB2009</a>]. They do not
+    use the factor `200` and use `60` rather than `59`. `\R\R\F(d)` is known
+    to do a decent job of combining scores, although there are some
+    recent techniques such as LambdaRank [<a href="#VLZ2012">VLZ2012</a>],
+    which do significantly better at the
+    expense of being harder to compute. To return results,
+    Yioop computes the top ten of
+    these `n` documents with respect to `\R\R\F(d)` and returns these
+    documents.</p>
+    <p> To get a feeling for how the `\R\R\F(d)` formula works, consider some
+    particular example situations:
+    If a document ranked 1 with respect to each score, then
+    `\R\R\F(d) = 200(3/(59+1)) = 10`.  If a document
+    ranked n for each score, then `\R\R\F(d) = 200(3/(59+n)) = 600/(59 + n)`.
+    As `n -> infty` this goes to 0. A value `n = 200` is often used with
+    Yioop. For this `n`, `600/(59 + n) approx 2.32`.
+    If a document
+    ranked 1 on one of the three scores, but ranked `n` on the other two,
+    `\R\R\F(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last
+    term again goes to 0 as `n` gets larger, giving a maximum
+    score of `3.33`. For the `n=200` case, one gets a score of `4.88`.
+    So because the three component scores are converted to ranks,
+    and then reciprocal rank fusion is used, one cannot solely use a good score
+    on one of the three components to get a good score overall.</p>
+    <p>An underlying assumption used by Yioop is that the first `n` matching
+    documents in Yioop's posting lists contain the 10 most important documents
+    with respect to our scoring function. For this assumption to be valid our
+    posting list must be roughly sorted according to score. For Yioop though,
+    the first `n` documents will in fact most likely be the first `n` documents
+    that Yioop indexed. This does not contradict the assumption
+    provided we are indexing documents according to the importance of our
+    documents. To do this Yioop tries to index according to Doc Rank and assume
+    the affects of relevance and proximity are not too drastic. That is, they
+    might be able to move the 100th document into the top 10, but not say the
+    1000th document into the top 10.</p>
+    <p>To see how it is
+    possible to roughly index according to document importance, we next
+    examine how data is acquired during a Yioop web crawl (the process
+    for an archive crawl is somewhat different). This is not only important for
+    determining the Doc
+    Rank of a page, but the text extraction that occurs after the page is
+    downloaded also affects the Relevance and Proximity scores. Once we
+    are done describing these crawl/indexing time factors affecting scores,
+    we will then consider search time factors which affect the scoring
+    of documents.</p>
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='crawl'>Crawl Time Ranking Factors</h2>
+    <h3 id='crawl-processes'>Crawl Processes</h3>
+    <p>A Yioop Crawl has three types of processes:</p>
+    <ol>
+    <li>A Name server, which acts as an overall coordinator for the crawl,
+    and which is responsible for starting and stopping the crawl</li>
+    <li>One or more Queue Servers, each of which maintain a priority queue of
+    what to download next.</li>
+    <li>One or more Fetchers, which actually download pages, and do initial
+    page processing.</li>
+    </ol>
+    <p>A crawl is started through the Yioop Web app on
+    the Name Server. For each url in the list of starting urls (Seed Sites),
+    its hostname is computed, a hash of the hostname is computed, and
+    based on this hash, that url is sent to a given queue server -- all
+    urls with the same hostname will be handled by the same queue server.
+    Fetchers periodically check the Name Server to see if there is an
+    active crawl, and if so, what its timestamp is. If there is an
+    active crawl, a Fetcher would then pick a Queue Server and request
+    a schedule of urls to download. By default, this can be as many as
+    DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls.</p>
+    <h3 id='fetchers'>Fetchers and their Effect on Search Ranking</h3>
+    <p>After receiving a batch of pages, the fetcher downloads pages in batches
+    of a hundred pages at a time. When the fetcher requests a URL for download
+    it sends a range request header asking for the first PAGE_RANGE_REQUEST
+    (defaults to 50000) many bytes. Some servers do not know how many bytes
+    they will send before sending, they might operate in "chunked" mode,
+    so after receiving the page, the fetcher discards any data after the first
+    PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants
+    that we mention such as PAGE_RANGE_REQUEST can be found in
+    configs/config.php .
+    For each page in the batch of a hundred urls downloaded, the
+    fetcher proceeds through a sequence of processing steps to:</p>
+    <ol>
+    <li>Determine page mimetype and choose a page processor.</li>
+    <li>Use the page processor to extract a summary for the document.</li>
+    <li>Apply any indexing plugins for the page processor to generate
+    auxiliary summaries and/or modify the extracted summary.</li>
+    <li>Calculate a hash from the downloaded page minus tags and
+    non-word characters to be used for deduplication.</li>
+    <li>Prune the number links extracted from the document down to
+    MAX_LINKS_PER_PAGE (defaults to 50).</li>
+    <li>Apply any user-defined page rules to the summary extracted.</li>
+    <li>Store full-cache of page to disk, add the location of full cache to
+    summary. Full cache pages are stored
+    in folders in WORK_DIRECTORY/cache/FETCHER_PREFIX-ArchiveCRAWL_TIMESTAMP.
+    These folder contain gzipped text files, web archives, each made up of
+    the concatenation of up to NUM_DOCS_PER_GENERATION many cache pages.
+    The class representing this  whole structure is called a
+    WebArchiveBundle (lib/web_archive_bundle.php). The class for a
+    single file is called a WebArchive (lib/web_archive.php).</li>
+    <li>Keep summaries in fetcher memory until they are shipped
+    off to the appropriate queue server in a process
+    we'll describe later.</li>
+    </ol>
+    <p>
+    After these steps, the fetcher checks the name server to see
+    if any crawl parameters
+    have changed or if the crawl has stopped before proceeding to download
+    the next batch of a hundred urls. It proceeds in this fashion until it
+    has downloaded and processed four to five hundred urls. It then
+    builds a "mini-inverted index" of the documents it has downloaded and
+    sends the inverted index, the summaries, any discovered urls, and any
+    robots.txt data it has downloaded back
+    to the queue server. It also sends back information on which hosts
+    that the queue server is responsible for that are generating more than
+    DOWNLOAD_ERROR_THRESHOLD (10) HTTP errors in a given schedule.
+    These hosts will automatically be crawl-delayed by
+    the queue server. Sending all of this data,
+    allows the fetcher to clear some of its memory and continue
+    processing its batch of 5000 urls until it has downloaded all of them.
+    At this point, the fetcher picks another queue server and requests
+    a schedule of urls to download from it and so on.
+    </p>
+    <p>
+    Page rules, which can greatly effect the summary extracted for a page,
+    are described in more detail in the <a
+    href="?c=main&p=documentation#page-options">Page Options Section</a>
+    of the Yioop documentation. Before describing how the
+    "mini-inverted index" processing step is done, let's examine
+    Steps 1,2, and 5 above in a little more detail as they are very important
+    in determining what actually is indexed. Based usually on the
+    the HTTP headers, a
+    <a href="http://en.wikipedia.org/wiki/Internet_media_type">mimetype</a>
+    for each page is found. The mimetype determines which summary extraction
+    processor, in Yioop terminology, a page processor, is applied to the page.
+    As an example of the key role that the page processor plays in what
+    eventually ends up in a Yioop index, we list what the HTML page processor
+    extracts from a page and how it does this extraction:
+    </p>
+    <dl>
+    <dt>Language</dt><dd>Document language is used to determine
+    how to make terms from the words in a document. For example, if the
+    language is English, Yioop uses the English stemmer on a
+    document. So the word "jumping" in the document will get indexed as
+    "jump". On the other hand, if the language was determined to be Italian
+    then a different stemmer would be used and "jumping" would remain
+    "jumping". The HTML processor determines the language by first looking
+    for a lang attribute on the &lt;html&gt; tag in the document. If
+    none is found it checks it the frequency of characters is close enough
+    to English to guess the document is English. If this fails it leaves
+    the value blank.</dd>
+    <dt>Title</dt><dd>When search results are displayed, the extracted
+    document title is used as the link text. Words in the title also
+    are given a higher value when Yioop calculates its relevance statistic.
+    The HTML processor uses the contents of the &lt;title&gt; tag
+    as its default title. If this tag is not present or is empty,
+    Yioop then concatenates the contents of the &lt;h1&gt; to &lt;h6&gt;
+    tags in the document. The HTML processor keeps only the
+    first hundred (HtmlProcessor::MAX_TITLE_LEN) characters of the title.
+    </dd>
+    <dt>Description</dt><dd>The description is used when search results
+    are displayed to generate the snippets beneath the result link.
+    Besides title, it has the remainder on the page words that are
+    used to identify a document. To obtain a description, the HTML processor
+    first takes the value of the content attribute of any &lt;meta&gt; tag
+    whose name attribute is some case invariant of "description".
+    To this it concatenates the non-tag
+    contents of the first four &lt;p&gt; and &lt;div&gt; tags,
+    followed by the content of &lt;td&gt;, &lt;li&gt;,
+    &lt;dt&gt;, &lt;dd&gt;, and &lt;a&gt; tags until it reaches
+    a maximum of HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters.
+    These items are added from the one
+    with the most characters to the one with the least.</dd>
+    <dt>Links</dt><dd>Links are used by Yioop to obtain new pages
+    to download. They are also treated by Yioop as "mini-documents".
+    The url of such mini document is the target website of the
+    link, the link text is used as a description. As we will see
+    during searching, these mini-documents get combined with the
+    summary of the site linked to.The HTML processor extracts
+    links from &lt;a&gt;, &lt;frame&gt;, &lt;iframe&gt;, and &lt;img&gt;
+    tags. It extracts up to 300 links per document. When it extracts
+    links it canonicalizes relative links. If a &lt;base&gt; tag was present
+    it uses it as part of the canonicalization process. Link text is
+    extracted from &lt;a&gt; tag contents and from alt attributes of
+    &lt;img&gt;'s. In addition, rel attributes are examined for robot
+    directives such as nofollow.</dd>
+    <dt>Robot Metas</dt><dd>This is used to keep track of
+    any robot directives that occurred in meta tags in the document.
+    These directives are things such a NOFOLLOW, NOINDEX, NOARCHIVE, and
+    NOSNIPPET. These can affect what links are extracted from the page,
+    whether the page is indexed, whether cached versions of the page
+    will be displayable from the Yioop interface, and whether snippets
+    can appear beneath the link on a search result page. The HTML
+    processor does a case insensitive match on &lt;meta&gt; tags
+    that contain the string "robot" (so it will treat such tags that contain
+    robot and robots the same). It then extracts the directives from
+    the content attribute of such a tag.</dd>
+    </dl>
+    <p>
+    The page processors for other mimetypes extract similar fields but
+    look at different components of their respective document types.
+    </p>
+    <p>After the page processor is done with a page, non-robot and sitemap
+    pages then pass through a pruneLinks method. This culls the up to 300 links
+    that might have been extracted down to 50. To do this, for each link,
+    the link text is gzipped and the length of the resulting string is
+    determined. The 50 unique links of longest length are then kept. The idea
+    is that we want to keep links whose text carry the most information.
+    Gzipping is a crude way to eliminate text with lots of redundancies.
+    The length then measures how much useful text is left. Having more
+    useful text means that the link is more likely to be helpful to find
+    the document.</p>
+    <p>
+    Now that we have finished discussing Steps 1,2, and 5, let's describe what
+    happens when building a mini-inverted index. For the four to- five hundred
+    summaries that we have at the start of mini-inverted index
+    step, we make associative arrays of the form:
+    </p>
+    <pre>
+    term_id_1 =&gt; ...
+    term_id_2 =&gt; ...
+    ...
+    term_id_i =&gt;
+            ((summary_map_1, (positions in summary 1 that term i appeared) ),
+             (summary_map_2, (positions in summary 2 that term i appeared) ),
+              ...)
+    ...
+    </pre>
+    <p>Term IDs are 8 byte strings consisting of the XOR of the two halves
+    of the 16 byte md5 hash of the term. Summary map numbers are
+    offsets into a table which can be used to look up a summary. These
+    numbers are increasing order of when the page was put into the
+    mini-inverted index. To calculate a position of a term, a string is made
+    from terms extracted from the url followed by the summary title
+    followed by the summary description. One counts
+    the number of terms from the start of this string. For example, suppose
+    we had two summaries:</p>
+    <pre>
+    Summary 1:
+    URL: http://test.yioop.com/
+    Title: Fox Story
+    Description: The quick brown fox jumped over the lazy dog.
+
+    Summary 2: http://test.yioop2.com/
+    Title: Troll Story
+    Description: Once there was a lazy troll, P&amp;A, who lived on my
+        discussion board.
+    </pre>
+    <p>The mini-inverted index might look like:</p>
+    <pre>
+    (
+        [test] => ( (1, (0)), (2, (0)) )
+        [yioop] =>  ( (1, (1)) )
+        [yioop2] =>  ( (2, (1)) )
+        [fox] => ( (1, (2, 7)) )
+        [stori] => ( (1, (3)), (2, (3)) )
+        [the] => ( (1, (4, 10)) )
+        [quick] => ( (1, (5)) )
+        [brown] => ( (1, (6)) )
+        [jump] => ( (1, (8)) )
+        [over] => ( (1, (9)) )
+        [lazi] => ( (1, (11)), (2, (8)) )
+        [dog] => ( (1, (12)) )
+        [troll] => ( (2, (2, 9)) )
+        [onc] => ( (2, (4)) )
+        [there] => ( (2, (5)) )
+        [wa] => ( (2, (6)) )
+        [a] => ( (2, (7))) )
+        [p_and_a] => ( (2, (10)) )
+        [who] => ( (2, (11)) )
+        [live] => ( (2, (12)) )
+        [on] => ( (2, (13)) )
+        [my] => ( (2, (14)) )
+        [discuss] => ( (2, (15)) )
+        [board] => ( (2, (16)) )
+    )
+    </pre>
+    <p>The list associated with a term is called a <b>posting list</b>
+    and an entry in this list is called  a <b>posting</b>. Notice terms
+    are stemmed when put into the mini-inverted index.
+    Also, observe acronyms, abbreviations, emails, and urls, such as
+    P&amp;A, will be manipulated before being put into the index. For
+    some Asian languages such as Chinese where spaces might not be placed
+    between words char-gramming is done instead. If two character
+    char-gramming is used, the string:
+    您要不要吃？ becomes 您要 要不 不要 要吃 吃？ A user query 要不要 will,
+    before look-up, be converted to the conjunctive query 要不 不要 and so
+    would match a document containing 您要不要吃？ Yioop can also be
+    <a href="?c=main&p=documentation#token_tool">configured to make use of a
+    Bloom filter</a> containing n-word grams for a language. This is typically
+    done for n-word grams coming from Wikipedia page titles. So for example,
+    if the document had "Rolling Stones" beginning at the position 7. This
+    would be recognized as an n-word gram in such a Bloom filter and
+    three terms would be extracted [roll stone] at position 7, [roll] at
+    position 7, and [stone] at position 8. In this way, a query for just
+    roll will match this document, as will one for just stone. On the other
+    a query for rolling stones will also match and will make use of
+    the position list for [roll stone], so only documents with these two
+    terms adjacent would be returned.
+    </p>
+    <p>It should be recalled that links are treated as their own little
+    documents and so will be treated as separate documents when making the
+     mini-inverted index. The url of a link is what it points to not the page
+    it is on. So the hostname of the machine that it points to might not be a
+    hostname handled by the queue server from which the schedule was downloaded.
+    In reality, the fetcher actually partitions link documents according to
+    queue server that will handle that link, and builds separate mini-inverted
+    indexes for each queue server. After building mini-inverted indexes,
+    it sends to the queue server the schedule was downloaded from,
+    inverted index data, summary data, host error data, robots.txt
+    data, and discovered links data that was destined for it. It keeps in
+    memory all the other inverted index data destined for other machines.
+    It will send this data to the appropriate queue servers later -- the
+    next time it downloads and processes data for these servers. To make
+    sure this scales, the fetcher checks its memory usage, if it is getting
+    low, it might send some of this data for other queue servers early.</p>
+
+    <h3 id='queue-servers'>Queue Servers and their Effect on Search Ranking</h3>
+
+    <p>It is back on a queue server that the building blocks for
+    the Doc Rank, Relevance and Proximity scores are assembled. To see
+    how this happens we continue to follow the flow of the data through
+    the web crawl process.
+    </p>
+    <p>
+    To communicate with a queue server, a fetcher posts data to the web app
+    of the queue server. The web app writes mini-inverted index and summary
+    data into a file in the WORK_DIRECTORY/schedules/IndexDataCRAWL_TIMESTAMP
+    folder. Similarly, robots.txt data from a batch of 400-500 pages
+    is written to WORK_DIRECTORY/schedules/RobotDataCRAWL_TIMESTAMP, and
+    "to crawl" urls are written to
+    WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP. The Queue Server
+    periodically checks these folders for new files to process. It is often
+    the case that files can be written to these folders faster than the
+    Queue Server can process them.
+    </p>
+    <p>A queue server consists of two separate sub-processes:</p>
+    <dl>
+    <dt>An Indexer</dt><dd>The indexer is responsible for reading Index Data
+    files and building a Yioop index.</dd>
+    <dt>A Scheduler</dt><dd>The scheduler maintains a priority queue of
+    what urls to download next. It is responsible for reading
+    SchedulateData files to update its priority queue and it is
+    responsible for making sure urls that urls
+    forbidden by RobotData files do not enter the queue.</dd>
+    </dl>
+    <p>When the Indexer processes a schedule IndexData file, it saves the
+    data in an IndexArchiveBundle (lib/index_archive_bundle). These objects
+    are serialized to folders with names of the form:
+    WORK_DIRECTORY/cache/IndexDataCRAWL_TIMESTAMP . IndexArchiveBundle's have
+    the following components:
+    <dl>
+    <dt>summaries</dt><dd>This is a WebArchiveBundle folder containing
+    the summaries of pages read from fetcher-sent IndexData files.</dd>
+    <dt>posting_doc_shards</dt><dd>This contains a sequence of
+    inverted index files, shardNUM, called IndexShard's. shardX holds the
+    postings lists for the Xth block of NUM_DOCS_PER_GENERATION many
+    summaries. NUM_DOCS_PER_GENERATION default to 50000 if Queue Server is
+    on a machine with at least 1Gb of memory. shardX also has postings for the
+    link documents that were acquired while acquiring these summaries.</dd>
+    <dt>generation.txt</dt><dd>Contains a serialized PHP object which
+    says what is the active shard -- the X such that shardX will receive
+    newly acquired posting list data.</dd>
+    <dt>dictionary</dt><dd>The dictionary contains a sequence of subfolders
+    used to hold for each term in a Yioop index the offsets and length in each
+    IndexShard where the posting list for that term are stored.</dd>
+    </dl>
+    <p>Of these components posting_doc_shards are the most important
+    with regard to page scoring. When a schedules/IndexData file is read,
+    the mini-inverted index in it is appended to the active IndexShard.
+    To do this append, all the summary map offsets, need to adjusted so they
+    now point to locations at the end of the summary of the IndexShard.
+    These offsets thus provide information about when a document was indexed
+    during the crawl process. The maximum number of links per document
+    is usually 50 for normal documents and 300 for sitemaps. Emperically,
+    it has been observed that a typical index shard has offsets for around
+    24 times as many links summary maps as document summary maps. So
+    roughly, if a newly added summary or link has index <i>DOC_INDEX</i>
+    in the active shard, and the active shard is the GENERATIONth shard,
+    the newly add object will have
+    </p>
+    <p>
+    `mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times
+            mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+    `\qquad  = (mbox(DOC_INDEX) + 1) + 25 times
+            mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+    </p>
+    <p>To make this a score out of 10, we can use logarithms:</p>
+    <p>`mbox(DOC_RANK) = 10 - log_(10)(mbox(RANK)).`</p>
+    <p>So this gives us a DOC_RANK for one link or summary item stored
+    in a Yioop index. However, as we will see, this does not give us the
+    complete value of DOC_RANK when computed at query time.</p>
+    <p>Index shards are important for determining relevance and proximity
+    scores as well. An index shard stores the number of doc seen,
+    number of links seen, the sum of the lengths of all summaries, the
+    sum of the length of all links. From these we can derive average
+    summary lengths, and average link lengths. From a posting, the
+    number of occurences of a term in a document can be calculated.
+    These will all be useful statistics for when we compute relevance.
+    As we will see, when we compute relevance, we use the average values
+    obtained for the particular shard the summary occurs in as a proxy
+    for their value throughout all shards. The fact that a posting
+    contains a position list of the location of a term within a
+    document will be use when we calculate poximity scores.</p>
+    <p>We next turn to the role of a Queue Server's Scheduler in
+    the computation of a page's Doc Rank.</p>
+How data is split amongst Fetchers, Queue Servers and Name Servers
+Web Versus Archive Crawl
+The order in which something is crawled. Opic or Breadth-first
+Company level domains
+robots.txt crawl delay.
+Queue size in ram. Schedule on disk.
+Page Range Request
+Mimetype
+Summary Extraction. Title description link extraction
+(what are important elements on page for html.
+Page Rules
+Statistics come from mini inverted indexes, not whole crawl.
+Stemming or char gramming
+n-gram word filter
+special characters and acronyms
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id='search'>Search Time Ranking Factors</h2>
+calculateControlWords (SearchController)
+
+guessSemantics (PhraseModel)
+
+stemming word gramming
+special characters and acronyms
+Network Versus non network queries
+Grouping (links and documents) deduplication
+Conjunctive queries
+Scores BM25F, proximity, document rank
+
+News articles, Images, Videos
+
+How related queries work
+    <p><a href="#toc">Return to table of contents</a>.</p>
+    <h2 id="references">References</h2>
+    <dl>
+<dt id="APC2003">[APC2003]</dt>
+<dd>Serge Abiteboul and Mihai Preda and Gregory Cobena.
+<a href="http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf"
+>Adaptive on-line page importance computation</a>.
+In: Proceedings of the 12th international conference on World Wide Web.
+pp. 280-290. 2003.
+</dd>
+<dt id="CCB2009">[CCB2009]</dt>
+<dd>Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher.
+<a href="http://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf"
+>Reciprocal Rank Fusion outperforms Condorcet and
+individual Rank Learning Methods</a>. In:
+Proceedings of the 32nd Annual International ACM SIGIR Conference on Research
+and Development in Information Retrieval. pp.758--759. 2009.
+</dd>
+
+<dt id="LLWL2009">[LLWL2009]</dt>
+<dd>H.-T. Lee, D. Leonard, X. Wang, D. Loguinov.
+<a href="http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf"
+>IRLbot: Scaling to 6 Billion Pages and Beyond</a>.
+ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
+</dd>
+<dt id="VLZ2012">[VLZ2012]</dt>
+<dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel.
+<a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf"
+>Learning to rank by aggregating expert preferences</a>.
+21st ACM International Conference on Information and Knowledge Management.
+pp. 843-851. 2012.
+</dd>
+</dl>
+
+    <p><a href="#toc">Return to table of contents</a>.</p>
+</div>
+<script type="text/javascript"
+   src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_HTMLorMML"></script>
+

ViewGit