Yioop Search Engine Ranking Mechanisms

Introduction

A typical query to Yioop is a collection of terms without the use of the OR operator, '|', or the use of the exact match operator, double quotes. On such a query, called a conjunctive query , Yioop tries to return documents which contain all of the query terms. If the sequence of words is particularly common, Yioop will try to return results which have that string with the same word order. Yioop further tries to return these documents in descending order of score. Most users only look at the first ten of the results returned. This article tries to explain the different factors which influence whether a page that has all the terms will make it into the top ten. To keep things simple we will assume that the query is being performed on a single Yioop index rather than a crawl mix of several indexes. We will also ignore how news feed items get incorporated into results.
At its heart, Yioop relies on three main scores for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox). Proximity scores are only used if the query has two or more terms. We will describe later how these three scores are calculated. For now one can think that the Doc Rank roughly indicates how important the document as a whole is, Relevance measures how important the search terms are to the document, and Proximity measures how close the search terms appear to each other on the document. In addition to these three basic scores, a user might select when they perform a crawl that a classifier be used for ranking purposes. After our initial discussion, we will say how we incorporate classifier scores.
On a given query, Yioop does not scan its whole posting lists to find every document that satisfies the query. Instead, it scans until it finds a fixed number of documents, say `n`, satisfying the query or until a timeout is exceeded. In the case of a timeout, `n` is just the number of documents found by the timeout. It then computes the three scores for each of these `n` documents. For a document `d` from these `n` documents, it determines the rank of `d` with respect to the Doc Rank score, the rank of `d` with respect to the Relevance score, and the rank of `d` with respect to the Proximity score. It finally computes a score for each of these n documents using these three rankings and reciprocal rank fusion (RRF) :
`mbox(RRF)(d) := 200(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
This formula essentially comes from Cormack et al. [ CCB2009 ]. They do not use the factor 200 and use 60 rather than 59. `mbox(RRF)(d)` is known to do a decent job of combining scores, although there are some recent techniques such as LambdaRank [ VLZ2012 ], which do significantly better at the expense of being harder to compute. To return results, Yioop computes the top ten of these n documents with respect to `mbox(RRF)(d)` and returns these documents.
It is relatively straightforward to extend the RRF(d) formula to handle scores coming from classifiers: One just adds additional reciprocal terms for each classifier score. For example, if `CL_1,...,CL_n` were the scores from the classifiers being used for ranking, then the formula would become:
`mbox(RRF)(d) := frac{200}{n+3}(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} +` `\qquad\qquad\qquad frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)} + sum_{i=1}^nfrac{1}{59 + mbox(Rank)_(mbox(CL)_i)(d)}).`
To get a feeling for how the `mbox(RRF)(d)` formula works, let's return to the non-classifiers case and consider some particular example situations: If a document ranked 1 with respect to each score, then `mbox(RRF)(d) = 200(3/(59+1)) = 10`. If a document ranked n for each score, then `mbox(RRF)(d) = 200(3/(59+n)) = 600/(59 + n)`. As `n -> infty`, this goes to `0`. A value `n = 200` is often used with Yioop. For this `n`, `600/(59 + n) approx 2.32`. If a document ranked 1 on one of the three scores, but ranked `n` on the other two, `mbox(RRF)(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last term again goes to 0 as `n` gets larger, giving a maximum score of `3.33`. For the `n=200` case, one gets a score of `4.88`. So because the three component scores are converted to ranks, and then reciprocal rank fusion is used, one cannot solely use a good score on one of the three components to get a good score overall.
An underlying assumption used by Yioop is that the first `n` matching documents in Yioop's posting lists contain the 10 most important documents with respect to our scoring function. For this assumption to be valid our posting list must be roughly sorted according to score. For Yioop though, the first `n` documents will in fact most likely be the first `n` documents that Yioop indexed. This does not contradict the assumption provided we are indexing documents according to the importance of our documents. To do this Yioop tries to index according to Doc Rank and assumes the affects of relevance and proximity are not too drastic. That is, they might be able to move the 100th document into the top 10, but not say the 1000th document into the top 10.
To see how it is possible to roughly index according to document importance, we next examine how data is acquired during a Yioop web crawl (the process for an archive crawl is somewhat different). This is not only important for determining the Doc Rank of a page, but the text extraction that occurs after the page is downloaded also affects the Relevance and Proximity scores. Once we are done describing these crawl/indexing time factors affecting scores, we will then consider search time factors which affect the scoring of documents and the actually formulas for Doc Rank, Relevance and Proximity.
Return to table of contents .

Crawl Time Ranking Factors

Crawl Processes

To understand how crawl and indexing time factors affect search ranking, let's begin by first fixing in our minds how a crawl works in Yioop. A Yioop crawl has three types of processes that play a role in this:
  1. A Name server, which acts as an overall coordinator for the crawl, and which is responsible for starting and stopping the crawl.
  2. One or more Queue Servers, each of which maintain a priority queue of what to download next.
  3. One or more Fetchers, which actually download pages, and do initial page processing.
A crawl is started through the Yioop Web app on the Name Server. For each url in the list of starting urls (Seed Sites), its hostname is computed, a hash of the hostname is computed, and based on this hash, that url is sent to a given queue server -- all urls with the same hostname will be handled by the same queue server. Fetchers periodically check the Name Server to see if there is an active crawl, and if so, what its timestamp is. If there is an active crawl, a Fetcher would then pick a Queue Server and request a schedule of urls to download. By default, this can be as many as DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls.
Let's examine the fetcher's role in determining what terms get indexed, and hence, what documents can be retrieved using those terms. After receiving a batch of pages, the fetcher downloads pages in batches of a hundred pages at a time. When the fetcher requests a URL for download it sends a range request header asking for the first PAGE_RANGE_REQUEST (defaults to 50000) many bytes. Only the data in these byte has any chance of becoming terms which are indexed. The reason for choosing a fixed, relatively small size is so that one can index a large number of documents even with a relatively small amount of disk space. Some servers do not know how many bytes they will send before sending, they might operate in "chunked" mode, so after receiving the page, the fetcher discards any data after the first PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants that we mention such as PAGE_RANGE_REQUEST can be found in configs/config.php. This particular constant can actually be set from the admin panel under the Page Options - Crawl Time. For each page in the batch of a hundred urls downloaded, the fetcher proceeds through a sequence of processing steps to:
  1. Determine page mimetype and choose a page processor (the source code for pace processors is in src/library/processors).
  2. Use the page processor to extract a summary for the document.
  3. Apply any indexing plugins for the page processor to generate auxiliary summaries and/or modify the extracted summary.
  4. Run classifiers on the summary and add any class labels and rank scores
  5. Calculate a hash from the downloaded page minus tags and non-word characters to be used for deduplication.
  6. Prune the number links extracted from the document down to MAX_LINKS_PER_PAGE (defaults to 50).
  7. Apply any user-defined page rules to the summary extracted.
  8. Store full-cache of page to disk, add the location of full cache to summary. Full cache pages are stored in folders in WORK_DIRECTORY/cache/FETCHER_PREFIX-ArchiveCRAWL_TIMESTAMP. These folder contain gzipped text files, web archives, each made up of the concatenation of up to NUM_DOCS_PER_GENERATION many cache pages. The class representing this whole structure is called a WebArchiveBundle (src/library/WebArchiveBundle.php). The class for a single file is called a WebArchive (src/library/WebArchive.php).
  9. Keep summaries in fetcher memory until they are shipped off to the appropriate queue server in a process we'll describe later.
After these steps, the fetcher checks the name server to see if any crawl parameters have changed or if the crawl has stopped before proceeding to download the next batch of a hundred urls. It proceeds in this fashion until it has downloaded and processed four to five hundred urls. It then builds a "mini-inverted index" of the documents it has downloaded and sends the inverted index, the summaries, any discovered urls, and any robots.txt data it has downloaded back to the queue server. It also sends back information on which hosts that the queue server is responsible for that are generating more than DOWNLOAD_ERROR_THRESHOLD (10) HTTP errors in a given schedule. These hosts will automatically be crawl-delayed by the queue server. Sending all of this data, allows the fetcher to clear some of its memory and continue processing its batch of 5000 urls until it has downloaded all of them. At this point, the fetcher picks another queue server and requests a schedule of urls to download from it and so on.
Page rules, which can greatly effect the summary extracted for a page, are described in more detail in the Page Options Section of the Yioop documentation. Before describing how the "mini-inverted index" processing step is done, let's examine Steps 1,2, and 6 above in a little more detail as they are very important in determining what actually is indexed. Based usually on the the HTTP headers, a mimetype for each page is found. The mimetype determines which summary extraction processor, in Yioop terminology, a page processor, is applied to the page. As an example of the key role that the page processor plays in what eventually ends up in a Yioop index, we list what the HTML page processor extracts from a page and how it does this extraction:
Language
Document language is used to determine how to make terms from the words in a document. For example, if the language is English, Yioop uses the English stemmer on a document. So the word "jumping" in the document will get indexed as "jump". On the other hand, if the language was determined to be Italian then a different stemmer would be used and "jumping" would remain "jumping". The HTML processor determines the language by first looking for a lang attribute on the <html> tag in the document. If none is found it checks if the frequency of characters is close enough to English to guess the document is English. If this fails it leaves the value blank.
Title
When search results are displayed, the extracted document title is used as the link text. Words in the title also are given a higher value when Yioop calculates its relevance statistic. The HTML processor uses the contents of the <title> tag as its default title. If this tag is not present or is empty, Yioop then concatenates the contents of the <h1> to <h6> tags in the document. The HTML processor keeps only the first hundred (HtmlProcessor::MAX_TITLE_LEN) characters of the title.
Description
The description is used when search results are displayed to generate the snippets beneath the result link. Besides title, it has the remainder on the page words that are used to identify a document. The HTML processor can obtain a description using one of two algorithms that can be set in page options. When using the basic summarizer, it first takes the value of the content attribute of any <meta> tag whose name attribute is some case invariant of "description". To this it concatenates the non-tag contents of the first four <p> and <div> tags, followed by the content of <td>, <li>, <dt>, <dd>, and <a> tags until it reaches a maximum of HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters. These items are added from the one with the most characters to the one with the least. The HTML processor can also obtain a description using a centroid summarizer. Here it removes all tags from the documents and splits the document into sentences. Ignoring common words (stop words), an average sentence vector is calculated. The components of this vector are terms and the value for a component represents the likelihood that a sentence in this document has that term. Then the distance between each sentence and this centroid is calculated and the closest sentences are added to the summary one by one until HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters has been reached.
Links
Links are used by Yioop to obtain new pages to download. They are also treated by Yioop as "mini-documents". The url of such a mini-document is the target website of the link, the link text is used as a description. As we will see during searching, these mini-documents get combined with the summary of the site linked to.The HTML processor extracts links from <a>, <frame>, <iframe>, and <img> tags. It extracts up to 300 links per document. When it extracts links it canonicalizes relative links. If a <base> tag was present, it uses it as part of the canonicalization process. Link text is extracted from <a> tag contents and from alt attributes of <img>'s. In addition, rel attributes are examined for robot directives such as nofollow.
Robot Metas
This is used to keep track of any robot directives that occurred in meta tags in the document. These directives are things such a NOFOLLOW, NOINDEX, NOARCHIVE, and NOSNIPPET. These can affect what links are extracted from the page, whether the page is indexed, whether cached versions of the page will be displayable from the Yioop interface, and whether snippets can appear beneath the link on a search result page. The HTML processor does a case insensitive match on <meta> tags that contain the string "robot" (so it will treat such tags that contain robot and robots the same). It then extracts the directives from the content attribute of such a tag.
The page processors for other mimetypes extract similar fields but look at different components of their respective document types.
After the page processor is done with a page, pages which aren't robot.txt pages which also aren't sitemap pages, then pass through a pruneLinks method. This culls the up to 300 links that might have been extracted down to 50. To do this, for each link, the link text is gzipped and the length of the resulting string is determined. The 50 unique links of longest length are then kept. The idea is that we want to keep links whose text carry the most information. Gzipping is a crude way to eliminate text with lots of redundancies. The length then measures how much useful text is left. Having more useful text means that the link is more likely to be helpful to find the document.
Now that we have finished discussing Steps 1,2, and 6, let's describe what happens when building a mini-inverted index. For the four to five hundred summaries that we have at the start of mini-inverted index step, we make associative arrays of the form:
 term_id_1 => ...
 term_id_2 => ...
 ...
 term_id_i =>
         ((summary_map_1, (positions in summary 1 that term i appeared) ),
          (summary_map_2, (positions in summary 2 that term i appeared) ),
           ...)
 ...
Term IDs are 20 byte strings. Terms might represent a single word or might represent phrases. The first 8 bytes of a term ID is the first 8 bytes of the md5 hash of the first word in the word or phrase. The next byte is used to indicate whether the term is a word or a phrase. If it is a word the remaining bytes are used to encode what kind of page the word occurs of (media:text, media:image, ... safe:true, safe:false, and some classifier labels if relevant). If it is a phrase, the remaining bytes encode various length hashes of the remaining words in the phrase. Summary map numbers are offsets into a table which can be used to look up a summary. These numbers are in increasing order of when the page was put into the mini-inverted index. To calculate a position of a term, the summary is viewed as a single string consisting of words extracted from the url concatenated with the summary title concatenated with the summary description. One counts the number of words from the start of this string. Phrases start at the position of their first word. Let's consider the case where we only have words and no phrases and we are ignoring the meta word info such as media: and safe:. Then suppose we had two summaries:
 Summary 1:
 URL: http://test.yioop.com/
 Title: Fox Story
 Description: The quick brown fox jumped over the lazy dog.
 Summary 2: http://test.yioop2.com/
 Title: Troll Story
 Description: Once there was a lazy troll, P&A, who lived on my
     discussion board.
The mini-inverted index might look like:
 (
     [test] => ( (1, (0)), (2, (0)) )
     [yioop] =>  ( (1, (1)) )
     [yioop2] =>  ( (2, (1)) )
     [fox] => ( (1, (2, 7)) )
     [stori] => ( (1, (3)), (2, (3)) )
     [the] => ( (1, (4, 10)) )
     [quick] => ( (1, (5)) )
     [brown] => ( (1, (6)) )
     [jump] => ( (1, (8)) )
     [over] => ( (1, (9)) )
     [lazi] => ( (1, (11)), (2, (8)) )
     [dog] => ( (1, (12)) )
     [troll] => ( (2, (2, 9)) )
     [onc] => ( (2, (4)) )
     [there] => ( (2, (5)) )
     [wa] => ( (2, (6)) )
     [a] => ( (2, (7))) )
     [p_and_a] => ( (2, (10)) )
     [who] => ( (2, (11)) )
     [live] => ( (2, (12)) )
     [on] => ( (2, (13)) )
     [my] => ( (2, (14)) )
     [discuss] => ( (2, (15)) )
     [board] => ( (2, (16)) )
 )
The list associated with a term is called a posting list and an entry in this list is called a posting . Notice terms are stemmed when put into the mini-inverted index. Also, observe acronyms, abbreviations, emails, and urls, such as P&A, will be manipulated before being put into the index. For some languages such as Japanese where spaces might not be placed between words, char-gramming is done instead. If two character char-gramming is used, the string: 源氏物語 (Tale of Genji) becomes 源氏 氏物 物語. A user query 源氏物 will, before look-up, be converted to the conjunctive query 源氏 氏物 and so would match a document containing 源氏物語.
The effect of the meta word portion of a term ID in the single word term case is to split the space of documents containing a word like "dog" into disjoint subsets. This can be used to speed up queries like "dog media:image", "dog media:video". The media tag for a page can only be one of media:text, media:image, media:video; it can't be more than one. A query of just "dog" will actually be calculated as a disjoint union of the fixed, finitely many single word term ID which begin with the same 8 bytes hash as "dog". A query of "dog media:image" will do a look up all term IDs with the same "dog" hash and "media:image" hash portion of the term ID. These term IDs will correspond to disjoint sets of documents which are process in order of doc offset.
In the worst case to do a conjunctive query takes time proportional to the shortest posting list. To try to get a better guarantee on the runtime of queries, Yioop ties to use Term IDs for phrases are used to speed up queries in the case of multi-word queries. On a query like "earthquake soccer", Yioop uses these term IDs to see how many documents have this exact phrase. If this is greater than a threshold (10), Yioop just does an exact phrase look up using these term IDs. If the number of query words is greater than three, Yioop always uses this mechanism to do look up. If the threshold is not met, Yioop checks if the threshold is met by all, but the last word, or by all but the first word. If so, it does the simpler conjunctive queries of the phrase plus the single word.
Yioop does not store phrase term IDs for every phrase it has ever found on some document in its index. Instead, it follows the basic approach of [ PTSHVC2011 ]. The main difference is that it stores data directly in its inverted index rather than their two ID approach. To get the idea of this approach, consider the stemmed document:
 jack be nimbl jack be quick jack jump the candlestick
The words that immediately follows each occurrence of "jack be" (nimbl, quick) in this document are not all the same. Phrases with this property are called maximal . The whole document "jack be nimbl jack be quick jack jump the candlestick" is also maximal and there is no prefix of it larger than "jack be" which is maximal. We would call this string conditionally maximal for "jack be". When processing a document, Yioop builds a suffix tree for it in linear time using Ukkonen's algorithm [ U1995 ]. It uses this tree to quickly build a list of maximal phrases of up to 12 words and any prefixes for which they are conditionally maximal. Only such maximal phrases will be given term IDs and stored in the index. The term ID for such a phrase begins with the 8 byte hash of the prefix for which it is maximal. This is followed by hashes of various lengths for the remaining terms. The format used is specified in the documentation of utility.php's crawlHashPath function. To do an exact lookup of a phrase like "jack be nimbl", it suffices to look up phrase term IDs which have their first 8 bytes either the hash of "jack", "jack be", or "jack be nimbl". Yioop only uses phrase term IDs for lookup of documents not for calculations like proximity where it uses the actual words that make up the phrase to get a score.
It should be recalled that links are treated as their own little documents and so will be treated as separate documents when making the mini-inverted index. The url of a link is what it points to not the page it is on. So the hostname of the machine that it points to might not be a hostname handled by the queue server from which the schedule was downloaded. In reality, the fetcher actually partitions link documents according to queue server that will handle that link, and builds separate mini-inverted indexes for each queue server. After building mini-inverted indexes, it sends to the queue server the schedule was downloaded from, inverted index data, summary data, host error data, robots.txt data, and discovered links data that was destined for it. It keeps in memory all the other inverted index data destined for other machines. It will send this data to the appropriate queue servers later -- the next time it downloads and processes data for these servers. To make sure this scales, the fetcher checks its memory usage, if it is getting low, it might send some of this data for other queue servers early.
It is back on a queue server that the building blocks for the Doc Rank, Relevance and Proximity scores are assembled. To see how this happens we continue to follow the flow of the data through the web crawl process.
To communicate with a queue server, a fetcher posts data to the web app of the queue server. The web app writes mini-inverted index and summary data into a file in the WORK_DIRECTORY/schedules/IndexDataCRAWL_TIMESTAMP folder. Similarly, robots.txt data from a batch of 400-500 pages is written to WORK_DIRECTORY/schedules/RobotDataCRAWL_TIMESTAMP, and "to crawl" urls are written to WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP. The Queue Server periodically checks these folders for new files to process. It is often the case that files can be written to these folders faster than the Queue Server can process them.
A queue server consists of two separate sub-processes:
An Indexer
The indexer is responsible for reading Index Data files and building a Yioop index.
A Scheduler
The scheduler maintains a priority queue of what urls to download next. It is responsible for reading SchedulateData files to update its priority queue and it is responsible for making sure urls that urls forbidden by RobotData files do not enter the queue.
When the Indexer processes a schedule IndexData file, it saves the data in an IndexArchiveBundle (lib/index_archive_bundle). These objects are serialized to folders with names of the form: WORK_DIRECTORY/cache/IndexDataCRAWL_TIMESTAMP . IndexArchiveBundle's have the following components:
summaries
This is a WebArchiveBundle folder containing the summaries of pages read from fetcher-sent IndexData files.
posting_doc_shards
This contains a sequence of inverted index files, shardNUM, called IndexShard's. shardX holds the postings lists for the Xth block of NUM_DOCS_PER_GENERATION many summaries. NUM_DOCS_PER_GENERATION default to 40000 if the queue server is on a machine with at least 2Gb of memory. shardX also has postings for the link documents that were acquired while acquiring these summaries.
generation.txt
Contains a serialized PHP object which says what is the active shard -- the X such that shardX will receive newly acquired posting list data.
dictionary
The dictionary contains a sequence of subfolders used to hold for each term in a Yioop index the offsets and length in each IndexShard where the posting list for that term are stored.
Of these components posting_doc_shards are the most important with regard to page scoring. When a schedules/IndexData file is read, the mini-inverted index in it is appended to the active IndexShard. To do this append, all the summary map offsets, need to adjusted so they now point to locations at the end of the summary of the IndexShard to which data is being appended. These offsets thus provide information about when a document was indexed during the crawl process. The maximum number of links per document is usually 50 for normal documents and 300 for sitemaps. Empirically, it has been observed that a typical index shard has offsets for around 24 times as many links summary map entries as document summary map entries. So roughly, if a newly added summary or link, d, has index DOC_INDEX(d) in the active shard, and the active shard is the GENERATION(d) shard, the newly added object will have
\begin{eqnarray} \mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) + (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\ &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ &=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times \\ &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d)) \end{eqnarray}
To make this a score out of `10`, we can use logarithms:
`mbox(DR)(d) = 10 - log_(10)(mbox(RANK)(d)).`
Here `mbox(DR)(d)` is the Doc Rank for one link or summary item stored in a Yioop index. However, as we will see, this does not give us the complete value of Doc Rank for an item when computed at query time. There also some things to note about this formula:
  1. Unlike PageRank [ BP1998 ], it is not some kind of logarithm of a probability, it is the logarithm of a rank. A log probability would preserve information about relative importance of two pages. I.e., it could say something about how far apart things like the number 1 page was compared to the number 2 page. Doc Rank as measured so far does not do that.
  2. The Doc Rank is a positive number and less than 10 provided the index of the given queue server has fewer than 10 billion items. Since to index 10 billion items using Yioop, you would probably want multiple queue servers, Doc Rank's likely remain positive for larger indexes.
  3. If we imagined that Yioop indexed the web as a balanced tree starting from some seed node where RANK(i) labels the node i of the tree enumerated level-wise, then log25(RANK(d))=log10(RANK(d))log10(25) would be an estimate of the depth of a node in this tree. So Doc Rank can be viewed as an estimate of how far we are away from the root, with 10 being at the root.
  4. Doc Rank is computed by different queue servers independent of each other for the same index. So it is possible for two summaries to have the same Doc Rank in the same index, provided they are stored on different queue servers.
  5. For Doc Ranks to be comparable with each other for the same index on different queue servers, it is assumed that queue servers are indexing at roughly the same speed.
Besides Doc Rank, Index shards are important for determining relevance and proximity scores as well. An index shard stores the number of summaries seen, the number of links seen, the sum of the lengths of all summaries, the sum of the length of all links. From these statistics, we can derive average summary lengths, and average link lengths. From a posting, the number of occurences of a term in a document can be calculated. These will all be useful statistics for when we compute relevance. As we will see, when we compute relevance, we use the average values obtained for the particular shard the summary occurs in as a proxy for their value throughout all shards. The fact that a posting contains a position list of the location of a term within a document will be use when we calculate proximity scores.
We next turn to the role of a queue server's Scheduler process in the computation of a page's Doc Rank. One easy way, which is supported by Yioop, for a Scheduler to determine what to crawl next is to use a simple queue. This would yield roughly a breadth-first traversal of the web starting from the seed sites. Since high quality pages are often a small number of hops from any page on the web, there is some evidence [ NW2001 ] that this lazy strategy is not too bad for crawling according to document importance. However, there are better strategies. When Page Importance is chosen in the Crawl Order dropdown for a Yioop crawl, the Scheduler on each queue server works harder to make schedules so that the next pages to crawl are always the most important pages not yet seen.
One well-known algorithm for doing this kind of scheduling is called OPIC (Online Page Importance Computation) [ APC2003 ]. The idea of OPIC is that at the start of a crawl one divides up an initial dollar of cash equally among the starting seed sites. One then picks a site with highest cash value to crawl next. If this site had `alpha` cash value, then when we crawl it and extract links, we divide up the cash and give it equally to each link. So if there were `n` links, each link would receive from the site `alpha/n` cash. Some of these sites might already have been in the queue in which case we add to their cash total. For URLs not in the queue, we add them to the queue with initial value `alpha/n`. Each site has two scores: Its current cash on hand, and the total earnings the site has ever received. When a page is crawled, its cash on hand is reset to `0`. We always choose as the next page to crawl from amongst the pages with the most cash (there might be ties). OPIC can be used to get an estimate of the importance of a page, by taking its total earnings and dividing it by the total earnings received by all pages in the course of a crawl.
In the experiments conducted by the original paper, OPIC was shown to crawl in a better approximation to page rank order than breadth-first search. Bidoki and Yazdani [ BY2008 ] have more recently proposed a new page importance measure DistanceRank, they also confirm that OPIC does better than breadth-first, but show the computationally more expensive Partial PageRank and Partial DistanceRank perform even better. Yioop uses a modified version of OPIC to choose which page to crawl next.
To save a fair bit of crawling overhead, Yioop does not keep for each site crawled historical totals of all earnings a page has received. The cash-based approach is only used for scheduling. Here are some of the issues addressed in the OPIC-based algorithm employed by Yioop:
  1. A Scheduler must ensure robots.txt files are crawled before any other page on the host. To do this, robots.txt files are inserted into the queue before any page from that site. Until the robots.txt file for a page is crawled, the robots.txt file receives cash whenever a page on that host receives cash.
  2. A fraction `alpha` of the cash that a robots.txt file receives is divided amongst any sitemap links on that page. Not all of the cash is given. This is to prevent sitemaps from "swamping" the queue. Currently, `alpha` is set 0.25. Nevertheless, together with the last bullet point, the fact that we do share some cash, means cash totals no longer sum to one.
  3. Cash might go missing for several reasons: (a) An image page, any other page, might be downloaded with no outgoing links. (b) A page might receive cash and later the Scheduler receives robots.txt information saying it cannot be crawled. (c) Round-off errors due to floating point precision. For these reasons, the Scheduler periodically renormalizes the total amount of cash..
  4. A robots.txt file or a slow host might cause the Scheduler to crawl-delay all the pages on the host. These pages might receive sufficient cash to be scheduled earlier, but won't be, because there must be a minimum time gap between requests to that host.
  5. When a schedule is made with a crawl-delayed host, URLs from that host cannot be scheduled until the fetcher that was processing them completes its schedule. If a Scheduler receives a "to crawl" url from a crawl-delayed host, and there are already MAX_WAITING_HOSTS many crawl-delayed hosts in the queue, then Yioop discards the url.
  6. The Scheduler has a maximum, in-memory queue size based on NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It will wait on reading new "to crawl" schedule files from fetchers if reading in the file would mean going over this count. For a typical, web crawl this means the "to crawl" files build up much like a breadth-first queue on disk.
  7. To make a schedule, the Scheduler starts processing the queue from highest priority to lowest. The up to 5000 urls in the schedule are split into slots of 100, where each slot of 100 will be required by the fetcher to take a MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls are inserted into the schedule at the earliest available position. If a URL is crawl-delayed it is inserted at the earliest position in the slot sufficient far from any previous url for that host to ensure that the crawl-delay condition is met.
  8. If a Scheduler's queue is full, yet after going through all of the url's in the queue it cannot find any to write to a schedule, it goes into a reset mode. It dumps its current urls back to schedule files, starts with a fresh queue (but preserving robots.txt info) and starts reading in schedule files. This can happen if too many urls of crawl-delayed sites start clogging a queue.
The actual giving of a page's cash to its urls is done in the Fetcher. We discuss it in the section on the queue server because it directly affects the order of queue processing. Cash in Yioop's algorithm is done in a different manner than in the OPIC paper. It is further handled differently for sitemap pages versus all other web pages. For a sitemap page with `n` links, let
`\gamma = sum_(j=1)^n 1/j^2`.
Let `C` denote the cash that the sitemap has to distribute. Then the `i`th link on the sitemap page receives cash
`C_i = C/(gamma cdot i^2)`.
One can verify that `sum_(i=1)^n C_i = C`. This weighting tends to favor links early in the sitemap and prevent crawling of sitemap links from clustering together too much. For a non-sitemap page, we split the cash by making use of the notion of a company level domain (cld). This is a slight simplification of the notion of a pay level domain (pld) defined in [ LLWL2009 ]. For a host of the form, something.2chars.2chars or blah.something.2chars.2chars, the company level domain is something.2chars.2chars. For example, for www.yahoo.co.uk, the company level domain is yahoo.co.uk. For any other url, stuff.2ndlevel.tld, the company level domain is 2ndlevel.tld. For example, for www.yahoo.com, the company level domain is yahoo.com. To distribute cash to links on a page, we first compute the company level domain for the hostname of the url of the page, then for each link we compute its company level domain. Let `n` denote the number of links on the page and let `s` denote the number of links with the same company level domain. If the cld of a link is the same as that the page, and the page had cash `C`, then the link will receive cash:
`frac{C}{2n}`
Notice this is half what it would get under usual OPIC. On the other hand, links to a different cld will receive cash:
`frac{C - s times C/(2n)}{n-s}`
The idea is to avoid link farms with a lot of internal links. As long as there is at least one link to a different cld, the payout of a page to its links will sum to C. If no links go out of the CLD, then cash will be lost. In the case where someone is deliberately doing a crawl of only one site, then this lost cash will get replaced during normalization, and the above scheme essentially reduces to usual OPIC.
We conclude this section by mentioning that the Scheduler only affects when a URL is written to a schedule which will then be used by a fetcher. It is entirely possible that two fetchers get consecutive schedules from the same Scheduler, and return data to the Indexers not in the order in which they were scheduled. In which case, they would be indexed out of order and their Doc Ranks would not be in the order of when they were scheduled. The scheduling and indexing process is only approximately correct, we rely on query time manipulations to try to improve the accuracy.
Return to table of contents .
We are at last in a position to describe how Yioop calculates the three scores Doc Rank, Relevance, and Proximity at query time. When a query comes into Yioop it goes through the following stages before an actual look up is performed against an index.
  1. Control words are calculated. Control words are terms like m: or i: terms which can be used to select a mix or index to use. They are also commands like raw: which says what level of grouping to use, or no: commands which say not to use a standard processing technique. For example, no:guess (affects whether the next processing step is done), no:network, etc. For the remainder, we will assume the query does not contain control words.
  2. An attempt is made to guess the semantics of the query. This matches keywords in the query and rewrites them to other query terms. For example, a query term which is in the form of a domain name, will be rewritten to the form of a meta word, site:domain. So the query will return only pages from the domain. Currently, this processing is in a nascent stage. As another example, if you do a search only on "D", it will rewrite the search to be "letter D".
  3. Stemming or character `n`-gramming is done on the query and acronyms and abbreviations are rewritten. This is the same kind of operation that we did after generating summaries to extract terms.
After going through the above steps, Yioop builds an iterator object from the resulting terms to iterate over summaries and link entries that contain all of the terms. The source code for Yioop's iterators can be found in src/library/index_bundle_iterators As described in the section Fetchers and their Effect on Search Ranking , some or all of these terms might be whole phrases to reduce the need for computing expensive conjunctive queries. In the single queue server setting one iterator would be built for each term and these iterators would be added to an intersect iterator that would return documents on which all the terms appear. This intersect iterator has a timer associated with it to prevent it from running too long in the case of a conjunctive query of terms with long posting lists with small intersection. These iterators are then fed into a grouping iterator, which groups links and summaries that refer to the same document url. Recall that after downloading pages on the fetcher, we calculated a hash from the downloaded page minus tags. Documents with the same hash are also grouped together by the group iterator. The value `n=200` posting list entries that Yioop scans out on a query referred to in the introduction is actually the number of results the group iterator requests before grouping. This number can be controlled from the Yioop admin pages under Page Options > Search Time > Minimum Results to Group. The number `200` was chosen because on a single machine it was found to give decent results without the queries taking too long.
In the multiple queue server setting, when the query comes in to the name server, a network iterator is built. This iterator poses the query to each queue server being administered by the name server. If `n=200`, the name server multiplies this value by the value Page Options > Search Time > Server Alpha, which we'll denote `alpha`. This defaults to 1.6, so the total is 320. It then divides this by the number of queue servers. So if there were 4 queue servers, one would have 80. It then requests the first 80 results for the query from each queue server. The queue servers don't do grouping, but just send the results of their intersect iterators to the name server, which does the grouping.
In both the networked and non-networked case, after the grouping phase Doc Rank, Relevance, and Proximity scores for each of the grouped results will have been determined. We then combine these three scores into a single score using the reciprocal rank fusion technique described in the introduction. Results are then sorted in descending order of score and output. What we have left to describe is how the scores are calculated in the various iterators mentioned above.
To fix an example to describe this process, suppose we have a group `G'` of items `i_j'`, either pages or links that all refer to the same url. A page in this group means that at some point we downloaded the url and extracted a summary. It is possible for there to be multiple pages in a group because we might re-crawl a page. If we have another group `G' '` of items `i_k' '` of this kind such that the hash of the most recent page matches that of `G'`, then the two groups are merged. While we are grouping, we are computing a temporary overall score for a group. The temporary score is used to determine which page's (or link's if no pages are present) summaries in a group should be used as the source of url, title, and snippets. Let `G` be the group one gets performing this process after all groups with the same hash as `G'` have been merged. We now describe how the individual items in `G` have their score computed, and finally, how these scores are combined.
The Doc Rank of an item `d`, `mbox(DR)(d)`, is calculated according to the formula mentioned in the queue servers subsection: \begin{eqnarray} \mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) + (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\ &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ &=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times \\ &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ \mbox{DR}(d) &=& 10 - \log_{10}(\mbox{RANK}(d)) \end{eqnarray} To compute the relevance of an item, we use a variant of BM25F [ ZCTSR2004 ]. Suppose a query `q` is a set of terms `t`. View an item `d` as a bag of terms, let `f_(t,d)` denote the frequency of the term `t` in `d`, let `N_t` denote the number of items containing `t` in the whole index (not just the group), let `l_d` denote the length of `d`, where length is the number of terms including repeats it contains, and let `l_{avg}` denote the average length of an item in the index. The basic BM25 formula is: \begin{eqnarray} \mbox{Score}_{\mbox{BM25}}(q, d) &=& \sum_{t \in q} \mbox{IDF}(t) \cdot \mbox{TF}_{\mbox{BM25}}(t,d), \mbox{ where }\\ \mbox{IDF}(t) &=& \log(\frac{N}{N_t})\mbox{, and}\\ \mbox{TF}_{\mbox{BM25}}(t,d) &=& \frac{f_{t,d}\cdot(k_1 +1)}{f_{t,d} + k_1\cdot ((1-b) + b\cdot(l_d / l_{avg}) )} \end{eqnarray} `mbox(IDF)(t)`, the inverse document frequency of `t`, in the above can be thought as measure of how much signal is provided by knowing that the term `t` appears in the document. For example, its value is zero if `t` is in every document; whereas, the more rare the term is the larger than value of `mbox(IDF)(t)`. `mbox(TF)_(mbox(BM25))` represents a normalized term frequency for `t`. Here `k_1 = 1.2` and `b=0.75` are tuned parameters which are set to values commonly used in the literature. `mbox(TF)_(mbox(BM25))` is normalized to prevent bias toward longer documents. Also, if one spams a document, filling it with many copies of the term `t`, we approach the limiting situation `lim_(f_(t,d) -> infty) mbox(TF)_(mbox(BM25))(t,d) = k_1 +1`, which as one can see prevents the document score from being made arbitrarily larger.
Yioop computes a variant of BM25F not BM25. This formula also needs to have values for things like `l_(avg)`, `N`, `N_t`. To keep the computation simple at the loss of some accuracy when Yioop needs these values it uses information from the statistics in the particular index shard of `d` as a stand-in. BM25F is essentially the same as BM25 except that it separates a document into components, computes the BM25 score of the document with respect to each component and then takes a weighted sum of these scores. In the case of Yioop, if the item is a page the two components are an ad hoc title and a description. Recall when making our position lists for a term in a documents that we concatenated url keywords, followed by title, followed by summary. So the first terms in the result will tend to be from title. We take the first AD_HOC_TITLE_LEN many terms from a document to be in the ad hoc title. We calculate an ad hoc title BM25 score for a term from a query being in the ad hoc title of an item. We multiply this by 2 and then compute a BM25 score of the term being in the rest of the summary. We add the two results. I.e.,
`mbox(Rel)(q, d) = 2 times mbox(Score)_(mbox(BM25-Title))(q, d) + mbox(Score)_(mbox(BM25-Description))(q, d)`
This score would be the relevance for a single page item `d` with respect to `q`. For link items we don't separate into title and description, but can weight the BM25 score different than for a page (currently, though, the link weight is set to 1 by default). These three weights: title weight, description weight, and link weight can be set in Page Options > Search Time > Search Rank Factors .
To compute the proximity score of an item `d` with respect to a query `q` with more than one term, we use the notion of a span . A span is an interval `[u_i, v_i]` of positions within `d` which contain all the terms (including repeats) in `q` such that no smaller interval contains all the terms (including repeats) . Given `d` we can calculate a proximity score as a sum of the inverse of the sizes of the spans:
`mbox(Prox)(d) = sum(frac(1)(v_i - u_i + 1))`.
This formula comes from Clark et al. [ CCT2000 ] except that they use covers, rather than spans, where covers ignore repeats. For a page item, Yioop calculates separate proximity scores with respect to its ad hoc title and the rest of a summary. It then adds them with the same weight as was done for the BM25F relevance score. Similarly, link item proximities also have a weight factor multiplied against them.
Now that we have described how to compute Doc Rank, Relevance, and Proximity for each item in a group, we now describe how to get these three values for the whole group. First, for proximity we take the max over all the proximity scores in a group. The idea is that since we are going out typically 200 results before grouping, each group has a relatively small number of items in it. Of these there will typically be at most one or two page items, and the rest will be link items. We aren't doing document length normalization for proximity scores and it might not make sense to do so for links data where the whole link text is relatively short. Thus, the maximum score in the group is likely to be that of a page item, and clicking the link it will be these spans the user will see. Let `[u]` denote all the items that would be grouped with url `u` in the grouping process, let `q` be a query. Let `Res(q)` denote results in the index satisfying query `q`, that is, having all the terms in the query. Then the proximity of `[u]` with respect to `q` is: \begin{eqnarray} \mbox{Prox}(q, [u]) &=& \mbox{max}_{i \in [u], i \in Res(q)}(\mbox{Prox}(q,i)). \end{eqnarray} For Doc Rank and Relevance, we split a group into subgroups based on the host name of where a link came from. So links from http://www.yahoo.com/something1 and http://www.yahoo.com/something2 to a url `u` would have the same hostname http://www.yahoo.com/. A link from http://www.google.com/something1 would have hostname http://www.google.com/. We will also use a weighting `wt(i)` which has value `2` if `i` is a page item and the url of i is a hostname, and 1 otherwise. Let `mbox(Host)(i)` denote the set of hostnames for a page item `i`, and let `mbox(Host)(i)` denote the hostnames of the page `i` came from in the case of a link item. Let
`H([u]) = { h \quad | h = mbox(Host)(i) \mbox ( for some ) i in [u]}`.
Let `[u]_h` be the items in `[u]` with hostname `h`. Let `([u]_h)_j` denote the `j`th element of `[u]_h` listed out in order of Doc Rank except that the first page item found is listed as `i_0`. It seems reasonable if a particular host tells us the site `u` is great multiple times, the likelihood that we would have our minds swayed diminishes with each repeating. This motivates our formulas for Doc Rank and Relevance which we give now: \begin{eqnarray} \mbox{Rel}(q, [u]) &=& \sum_{h \in H([u])} \sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{Rel}(q, ([u]_h)_j).\\ \mbox{DR}(q, [u]) &=& \sum_{h \in H([u])} \sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{DR}(q, ([u]_h)_j). \end{eqnarray} Now that we have described how Doc Rank, Relevance, and Proximity are calculated for groups, we have almost completed our description of the Yioop scoring mechanism in the conjunctive query case. After performing pre-processing steps on the query, Yioop retrieves the first `n` results from its index. Here `n` defaults to 200. It then groups the results and uses the formulas above to calculate the three scores for Doc Rank, Relevance, and Proximity. It then uses reciprocal rank fusion to combine these three scores into a single score, sorts the results by this score and returns to the user the top 10 of these results.

Final Reordering

The top 10 results produced in the last section are what is presented in a basic configuration of Yioop. It is possible to configure Yioop to make use of a thesaurus to reorder these 10 results before final presentation. When this is done, these 10 results are retrieved as described above. Yioop then does part of speech tagging on the original query. This is done with a simplified Brill tagger [ B1992 ]. Using the tagged version of the query, it looks up each term for its part of speech in a thesaurus for the current language. This is currently only implemented for English and the thesaurus used is WordNet. Possible synonyms for a term in Wordnet often have example sentences. If so, cosine or intersection rank scores of these sentences versus the original query are computed and the highest scoring synonym is selected. If there are no example sentences, then the first is selected. To calculate a cosine score we view the original query and the sentence as binary vectors where the coordinates of the vectors are labeled by terms. So, for example, the "the" coordinate of the original query would be 1 if the original query contained the word "the". The dot product of these two vectors divided by their lengths, then gives the cosine of the angle between them, the cosine score. This scoring is done for each term. Then for each term in the original query, the query is modified by swapping it for its synonym. The number of documents for the modified query as a whole phrase is looked up in the index dictionary for the current index. The three phrases which occur most often in the dictionary are then selected. For each of the top 10 documents for the query, the sum of the cosine similarities of these three phrases with a documents summary is computed to get a thesaurus score. The ten documents are then sorted by this score and displayed.
Return to table of contents .

References

[APC2003]
Serge Abiteboul and Mihai Preda and Gregory Cobena. Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp. 280-290. 2003.
[BY2008]
A. M. Z. Bidoki and Nasser Yazdani. DistanceRank: An intelligent ranking algorithm for web pages. Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892. March, 2008.
[B1992]
Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics. Stroudsburg, PA, USA. pp. 152--155.
[BP1998]
Brin, S. and Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh International World-Wide Web Conference (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.
[CCT2000]
Charles L. A. Clarke and Gordon V. Cormack and Elizabeth A. Tudhope. Relevance Ranking for One to Three Term Queries. In: Information Processing Management. Vol. 36. Iss. 2. pp.291--311. 2000.
[CCB2009]
Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher. Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp.758--759. 2009.
[LLWL2009]
H.-T. Lee, D. Leonard, X. Wang, D. Loguinov. IRLbot: Scaling to 6 Billion Pages and Beyond. ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
[NW2001]
Marc Najork and Janet L. Wiener. Breadth-First Search Crawling Yields High-Quality Pages. Proceedings of the 10th international conference on World Wide Web. pp 114--118. 2001.
[PTSHVC2011]
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. Inverted indexes for phrases and strings. Proceedings of the 34nth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011.
[U1995]
Ukkonen, E. On-line construction of suffix trees. Algorithmica. Vol. 14. Iss 3. pp. 249--260. 1995.
[VLZ2012]
Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel. Learning to rank by aggregating expert preferences. 21st ACM International Conference on Information and Knowledge Management. pp. 843-851. 2012.
[ZCTSR2004]
Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of 3th Annual Text Retrieval Conference. 2004.
Return to table of contents .