BUFFER_SIZE
BUFFER_SIZE
The maximum number of candidate documents to consider at once in order to find the best candidate.
The primary interface for building and using classifiers. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.
A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.
A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.
During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.
Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features. In addition to keeping track of the mapping between terms and feature indices, a Features instance keeps term and label statistics (such as how often a term occurs in documents with a particular label) used to weight features within a document and to select informative features. Finally, subclasses of the Features class weight features in different ways, presenting more or less of everything that's known about the frequency or informativeness of a feature to classification algorithms.
Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set. This results in a set of feature weights that can be used to efficiently assign a psuedo-probability to the proposition that a new web page is a positive instance of the class that the classifier has been trained to recognize. Training logistic regression on a large training set can take a long time, so this phase is carried out asynchronously, by a daemon launched in response to the finalization request.
Because the full Features instance, buffer, and training set are only needed during the labeling and finalization phases, and because they can get very large and take up a lot of space in memory, this class separates its large instance members into separate files when serializing to disk. When a classifier is first loaded into memory from disk it brings along only its summary statistics, since these are all that are needed to, for example, display a list of classifiers. In order to actually add new documents to the training set, finalize, or classify, the classifier must first be explicitly told to load the relevant data structures from disk; this is accomplished by methods like prepareToLabel and prepareToClassify. These methods load in the relevant serialized structures, and mark the associated data members for storage back to disk when (or if) the classifier is serialized again.
$options : array
Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:
float density.lambda: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).
float density.beta: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).
int label_fs.max: Use the `label_fs' most informative features to train the Naive Bayes classifiers used during labeling to compute disagreement for a document.
float threshold: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability
= `threshold' are classified as positive instances.
string final_algo: Algorithm to use for finalization; 'lr' for logistic regression, or 'nb' for Naive Bayes; default 'lr'.
int final_fs.max: Use the `final_fs' most informative features to train the final classifier.
$buffer : array
The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.
$final_features : object
The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.
$final_algorithm : object
The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.
__construct(string $label, array $options = array())
Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.
string | $label | class label applied to positive instances of the class this classifier is trained to recognize |
array | $options | optional associative array of options that will override the default options |
__sleep() : array
Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.
names of properties to store when serializing this instance
prepareToLabel()
Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).
prepareToFinalize()
Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.
labelDocument(string $key, integer $label, boolean $is_active = true) : boolean
Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).
When updating an existing document, we will either need to swap the label in the training set and update the statistics stored by the Features instance (since now the features are associated with a different label), or drop the document from the training set and (again) update the statistics stored by the Features instance. In either case the negative and positive counts must be updated as well.
When working with a new document, we need to remove it from the candidate buffer, and if the label is non-zero then we also need to add the document to the training set. That involves tokenizing the document, passing the tokens through the full_features instance, and storing the resulting feature vector, plus the new label in the docs attribute. The positive and negative counts must be updated as well.
Finally, if this operation is occurring active labeling (when the user is providing labels one at a time), that information needs to be passed along to dropBufferDoc, which can avoid doing some work in the non-active case.
string | $key | key used to select the document from the docs array |
integer | $label | new label (-1, 1, or 0) |
boolean | $is_active | whether this operation is being carried out during active labeling |
true if the training set was modified, and false otherwise
addAllDocuments(object $mix_iterator, integer $label, integer $limit = INF) : integer
Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.
Returns the total number of newly-labeled documents.
object | $mix_iterator | crawl mix iterator to draw documents from |
integer | $label | label to apply to every document; -1 or 1, but NOT 0 |
integer | $limit | optional upper bound on the number of documents to add; defaults to no limit |
total number of newly-labeled documents
initBuffer(object $mix_iterator, integer $buffer_size = null) : integer
Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.
object | $mix_iterator | crawl mix iterator to draw documents from |
integer | $buffer_size | optional buffer size to use; defaults to the runtime parameter |
final buffer size
refreshBuffer(object $mix_iterator, integer $buffer_size = null) : integer
Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.
Returns the final buffer size, which may be less than that requested if the iterator doesn't return enough documents.
object | $mix_iterator | crawl mix iterator to draw documents from |
integer | $buffer_size | optional buffer size to use; defaults to the runtime parameter |
final buffer size
computeBufferDensities()
Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.
The density of a document is approximated by its average overlap with every other document in the candidate buffer, where the overlap between two documents is itself approximated using the exponential, negative KL-divergence between them. The KL-divergence is smoothed to deal with features (terms) that occur in one distribution (document) but not the other, and then multiplied by a negative constant and exponentiated in order to convert it to a kind of linear overlap score.
findNextDocumentToLabel() : array
Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.
two-element array containing first the best candidate, and second the disagreement score, obtained by dividing the disagreement for the document by the maximum disagreement possible for the committee size
train(boolean $update_accuracy = false)
Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.
boolean | $update_accuracy | optional parameter specifying whether or not to update the accuracy estimate after training completes; defaults to false |
updateAccuracy(object $X = null, array $y = null)
Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.
object | $X | optional sparse matrix representing the already-mapped training set to use; if not provided, the current training set is mapped using the label_features property |
array | $y | optional array of document labels corresponding to the training set; if not provided the current training set labels are used |
finalize()
Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.
classify(array $page) : float
Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.
array | $page | page summary array for the page to be classified |
pseudo-probability that the page is a positive instance of the target class
addBufferDoc(array $page, boolean $is_active = true)
Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.
array | $page | page summary for the document to add to the buffer |
boolean | $is_active | whether this operation is part of active training, in which case some extra statistics must be maintained |
dropBufferDoc(boolean $is_active = true)
Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.
boolean | $is_active | whether this operation is part of active training, in which case some extra statistics must be maintained |
loadProperties()
Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.
labelPage(array $summary, array $classifiers, \seekquarry\yioop\library\classifiers\array& $active_classifiers, \seekquarry\yioop\library\classifiers\array& $active_rankers)
Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.
As an example, suppose that a classifier with class label `label' has determined that a document is a positive example with pseudo-probability 0.87 and threshold 0.5. The following meta words are added to the summary: class:label, class:label:80, class:label:80plus, class:label:70plus, class:label:60plus, and class:label:50plus.
array | $summary | page summary to classify, passed by reference |
array | $classifiers | list of Classifier instances, each prepared for classifying (via the prepareToClassify method) |
\seekquarry\yioop\library\classifiers\array& | $active_classifiers | |
\seekquarry\yioop\library\classifiers\array& | $active_rankers |
getClassifierList() : array
Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.
associative array of class labels mapped to their corresponding classifier instances
getClassifier(string $label) : object
Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.
string | $label | classifier's class label |
classifier instance with the relevant class label, or null if no such classifier exists on disk
loadClassifiersData(array $labels) : array
Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.
array | $labels | flat array of class labels for which to load data |
associative array mapping class labels to arrays of data necessary for initializing the associated classifier
newClassifierFromData(array $data) : object
The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.
array | $data | associative array mapping property names to their serialized and compressed data |
Classifier instance built from the passed-in data
setClassifier(object $classifier)
Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.
The classifier directory and all of its contents are made world-writable so that they can be manipulated without hassle from the command line.
object | $classifier | Classifier instance to store to disk |
klDivergenceToMean(array $ps) : float
Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.
array | $ps | probabilities describing several discrete two-element probability distributions |
KL-divergence to the mean for the collection of distributions