\seekquarry\yioop\library\classifiersClassifier

The primary interface for building and using classifiers. An instance of this class represents a single classifier in memory, but the class also provides static methods to manage classifiers on disk.

A single classifier is a tool for determining the likelihood that a document is a positive instance of a particular class. In order to do this, a classifier goes through a training phase on a labeled training set where it learns weights for document features (terms, for our purposes). To classify a new document, the learned weights for all terms in the document are combined in order to yield a pdeudo-probability that the document belongs to the class.

A classifier is composed of a candidate buffer, a training set, a set of features, and a classification algorithm. In addition to the set of all features, there is a restricted set of features used for training and classification. There are also two classification algorithms: a Naive Bayes algorithm used during labeling, and a logistic regression algorithm used to train the final classifier. In general, a fresh classifier will first go through a labeling phase where a collection of labeled training documents is built up out of existing crawl indexes, and then a finalization phase where the logistic regression algorithm will be trained on the training set established in the first phase. After finalization, the classifier may be used to classify new web pages during a crawl.

During the labeling phase, the classifier fills a buffer of candidate pages from the user-selected index (optionally restricted by a query), and tries to pick the best one to present to the user to be labeled (here `best' means the one that, once labeled, is most likely to improve classification accuracy). Each labeled document is removed from the buffer, converted to a feature vector (described next), and added to the training set. The expanded training set is then used to train an intermediate Naive Bayes classification algorithm that is in turn used to more accurately identify good candidates for the next round of labeling. This phase continues until the user gets tired of labeling documents, or is happy with the estimated classification accuracy.

Instead of passing around terms everywhere, each document that goes into the training set is first mapped through a Features instance that maps terms to feature indices (e.g. "Pythagorean" => 1, "theorem" => 2, etc.). These feature indices are used internally by the classification algorithms, and by the algorithms that try to pick out the most informative features. In addition to keeping track of the mapping between terms and feature indices, a Features instance keeps term and label statistics (such as how often a term occurs in documents with a particular label) used to weight features within a document and to select informative features. Finally, subclasses of the Features class weight features in different ways, presenting more or less of everything that's known about the frequency or informativeness of a feature to classification algorithms.

Once a sufficiently-useful training set has been built, a FeatureSelection instance is used to choose the most informative features, and copy these into a reduced Features instance that has a much smaller vocabulary, and thus a much smaller memory footprint. For efficiency, this is the Features instance used to train classification algorithms, and to classify web pages. Finalization is just the process of training a logistic regression classification algorithm on the full training set. This results in a set of feature weights that can be used to efficiently assign a psuedo-probability to the proposition that a new web page is a positive instance of the class that the classifier has been trained to recognize. Training logistic regression on a large training set can take a long time, so this phase is carried out asynchronously, by a daemon launched in response to the finalization request.

Because the full Features instance, buffer, and training set are only needed during the labeling and finalization phases, and because they can get very large and take up a lot of space in memory, this class separates its large instance members into separate files when serializing to disk. When a classifier is first loaded into memory from disk it brings along only its summary statistics, since these are all that are needed to, for example, display a list of classifiers. In order to actually add new documents to the training set, finalize, or classify, the classifier must first be explicitly told to load the relevant data structures from disk; this is accomplished by methods like prepareToLabel and prepareToClassify. These methods load in the relevant serialized structures, and mark the associated data members for storage back to disk when (or if) the classifier is serialized again.

Summary

Methods
Properties
Constants
__construct()
__sleep()
prepareToLabel()
prepareToFinalize()
prepareToClassify()
labelDocument()
addAllDocuments()
initBuffer()
refreshBuffer()
computeBufferDensities()
findNextDocumentToLabel()
train()
updateAccuracy()
finalize()
classify()
addBufferDoc()
dropBufferDoc()
moveBufferDocToFront()
tokenizeDescription()
loadProperties()
storeLoadedProperties()
labelPage()
getClassifierList()
getClassifier()
loadClassifiersData()
newClassifierFromData()
setClassifier()
deleteClassifier()
cleanLabel()
getCrawlMixName()
makeKey()
klDivergenceToMean()
$options
$class_label
$timestamp
$lang
$fresh
$finalized
$positive
$negative
$total
$accuracy
$buffer
$docs
$full_features
$label_features
$label_algorithm
$final_features
$final_algorithm
$loaded_properties
BUFFER_SIZE
COMMITTEE_SIZE
MAX_DISAGREEMENT
DENSITY_LAMBDA
DENSITY_BETA
THRESHOLD
UNFINALIZED
FINALIZING
FINALIZED
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

BUFFER_SIZE

BUFFER_SIZE

The maximum number of candidate documents to consider at once in order to find the best candidate.

COMMITTEE_SIZE

COMMITTEE_SIZE

The number of Naive Bayes instances to use to calculate disagreement during candidate selection.

MAX_DISAGREEMENT

MAX_DISAGREEMENT

The maximum disagreement score between candidates. This number depends on committee size, and is used to provide a slightly more user-friendly estimate of how much disagreement a document causes (between 0 and 1).

DENSITY_LAMBDA

DENSITY_LAMBDA

Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).

DENSITY_BETA

DENSITY_BETA

Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).

THRESHOLD

THRESHOLD

Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability >= THRESHOLD are classified as positive instances.

UNFINALIZED

UNFINALIZED

Indicates that a classifier needs to be finalized before it can be used.

FINALIZING

FINALIZING

Indicates that a classifier is currently being finalized (this may take a while).

FINALIZED

FINALIZED

Indicates that a classifier has been finalized, and is ready to be used for classification.

Properties

$options

$options : array

Default per-classifier options, which may be overridden when constructing a new classifier. The supported options are:

float density.lambda: Lambda parameter used in the computation of a candidate document's density (smoothing for 0-frequency terms).

float density.beta: Beta parameter used in the computation of a candidate document's density (sharpness of the KL-divergence).

int label_fs.max: Use the `label_fs' most informative features to train the Naive Bayes classifiers used during labeling to compute disagreement for a document.

float threshold: Threshold used to convert a pseudo-probability to a hard classification decision. Documents with pseudo-probability

= `threshold' are classified as positive instances.

string final_algo: Algorithm to use for finalization; 'lr' for logistic regression, or 'nb' for Naive Bayes; default 'lr'.

int final_fs.max: Use the `final_fs' most informative features to train the final classifier.

Type

array

$class_label

$class_label : string

The label applied to positive instances of the class learned by this classifier (e.g., `spam').

Type

string

$timestamp

$timestamp : integer

Creation time as a UNIX timestamp.

Type

integer

$lang

$lang : string

Language of documents in the training set (also how new documents will be treated).

Type

string

$fresh

$fresh : boolean

Whether or not this classifier has had any training examples added to it, and consequently whether or not its Naive Bayes classification algorithm has every been trained.

Type

boolean

$finalized

$finalized : integer

Finalization status, as determined by one of the three finalization constants.

Type

integer

$positive

$positive : integer

The number of positive examples in the training set.

Type

integer

$negative

$negative : integer

The number of negative examples in the training set.

Type

integer

$total

$total : integer

The total number of examples in the training set (sum of positive and negative).

Type

integer

$accuracy

$accuracy : float

The estimated classification accuracy. This member may be null if the accuracy has not yet been estimated, or out of date if examples have been added to the training set since the last accuracy update, but no new estimate has been computed.

Type

float

$buffer

$buffer : array

The current pool of candidates for labeling. The first element in the buffer is always the active document, and as active documents are labeled and removed, the pool is refreshed with new candidates (if there are more pages to be drawn from the active index). The buffer is represented as an associative array with three fields: 'docs', the candidate page summaries; 'densities', an array of densities computed for the documents in the candidate pool; and 'stats', statistics about the terms and documents in the current pool.

Type

array

$docs

$docs : array

The training set, broken up into two fields of an associative array: 'features', an array of document feature vectors; and 'labels', the labels assigned to each document.

Type

array

$full_features

$full_features : object

The Features subclass instance used to manage the full set of features seen across all documents in the training set.

Type

object

$label_features

$label_features : object

The Features subclass instance used to manage the reduced set of features used only by Naive Bayes classification algorithms during the labeling phase.

Type

object

$label_algorithm

$label_algorithm : object

The NaiveBayes classification algorithm used during training to tentatively classify documents presented to the user for labeling.

Type

object

$final_features

$final_features : object

The Features subclass instance used to map documents at classification time to the feature vectors expected by classification algorithms. This will generally be a reduced feature set, just like that used during labeling, but potentially larger than the set used by Naive Bayes.

Type

object

$final_algorithm

$final_algorithm : object

The finalized classification algorithm that will be used to classify new web pages. Will usually be logistic regression, but may be Naive Bayes, if set by the options. During labeling, this field is a reference to the Naive Bayes classification algorithm (so that that algorithm will be used by the `classify' method), but it won't be saved to disk as such.

Type

object

$loaded_properties

$loaded_properties : array

The names of properties set by one of the prepareTo* methods; these properties will be saved back to disk during serialization, while all other properties not listed by the __sleep method will be discarded.

Type

array

Methods

__construct()

__construct(string  $label, array  $options = array()) 

Initializes a new classifier with a class label, and options to override the defaults. The timestamp associated with the classifier is taken from the time of construction.

Parameters

string $label

class label applied to positive instances of the class this classifier is trained to recognize

array $options

optional associative array of options that will override the default options

__sleep()

__sleep() : array

Magic method that determines which member data will be stored when serializing this class. Only lightweight summary data are stored with the serialized version of this class. The heavier-weight properties are stored in individual, compressed files.

Returns

array —

names of properties to store when serializing this instance

prepareToLabel()

prepareToLabel() 

Prepare this classifier instance for labeling. This operation requires all of the heavyweight member data save the final features and algorithm. Note that these properties are set to references to the Naive Bayes features and algorithm, so that Naive Bayes will be used to tentatively classify documents during labeling (purely to give the user some feedback on how the training set is performing).

prepareToFinalize()

prepareToFinalize() 

Prepare to train a final classification algorithm on the full training set. This operation requires the full training set and features, but not the candidate buffer used during labeling. Note that any existing final features and classification algorithm are simply zeroed out; they are only loaded from disk so that they will be written back after finalization completes.

prepareToClassify()

prepareToClassify() 

Prepare to classify new web pages. This operation requires only the final features and classification algorithm, which are expected to be defined after the finalization phase.

labelDocument()

labelDocument(string  $key, integer  $label, boolean  $is_active = true) : boolean

Updates the buffer and training set to reflect the label given to a new document. The label may be -1, 1, or 0, where the first two correspond to a negative or positive example, and the last to a skip. The handling for a skip is necessarily different from that for a positive or negative label, and matters are further complicated by the possibility that we may be changing a label for a document that's already in the training set, rather than adding a new document. This function returns true if the new label resulted in a change to the training set, and false otherwise (i.e., if the user simply skipped labeling the candidate document).

When updating an existing document, we will either need to swap the label in the training set and update the statistics stored by the Features instance (since now the features are associated with a different label), or drop the document from the training set and (again) update the statistics stored by the Features instance. In either case the negative and positive counts must be updated as well.

When working with a new document, we need to remove it from the candidate buffer, and if the label is non-zero then we also need to add the document to the training set. That involves tokenizing the document, passing the tokens through the full_features instance, and storing the resulting feature vector, plus the new label in the docs attribute. The positive and negative counts must be updated as well.

Finally, if this operation is occurring active labeling (when the user is providing labels one at a time), that information needs to be passed along to dropBufferDoc, which can avoid doing some work in the non-active case.

Parameters

string $key

key used to select the document from the docs array

integer $label

new label (-1, 1, or 0)

boolean $is_active

whether this operation is being carried out during active labeling

Returns

boolean —

true if the training set was modified, and false otherwise

addAllDocuments()

addAllDocuments(object  $mix_iterator, integer  $label, integer  $limit = INF) : integer

Iterates entirely through a crawl mix iterator, adding each document (that hasn't already been labeled) to the training set with a single label. This function works by running through the iterator, filling up the candidate buffer with all unlabeled documents, then repeatedly dropping the first buffer document and adding it to the training set.

Returns the total number of newly-labeled documents.

Parameters

object $mix_iterator

crawl mix iterator to draw documents from

integer $label

label to apply to every document; -1 or 1, but NOT 0

integer $limit

optional upper bound on the number of documents to add; defaults to no limit

Returns

integer —

total number of newly-labeled documents

initBuffer()

initBuffer(object  $mix_iterator, integer  $buffer_size = null) : integer

Drops any existing candidate buffer, re-initializes the buffer structure, then calls refreshBuffer to fill it. Takes an optional buffer size, which can be used to limit the buffer to something other than the number imposed by the runtime parameter. Returns the final buffer size.

Parameters

object $mix_iterator

crawl mix iterator to draw documents from

integer $buffer_size

optional buffer size to use; defaults to the runtime parameter

Returns

integer —

final buffer size

refreshBuffer()

refreshBuffer(object  $mix_iterator, integer  $buffer_size = null) : integer

Adds as many new documents to the candidate buffer as necessary to reach the specified buffer size, which defaults to the runtime parameter.

Returns the final buffer size, which may be less than that requested if the iterator doesn't return enough documents.

Parameters

object $mix_iterator

crawl mix iterator to draw documents from

integer $buffer_size

optional buffer size to use; defaults to the runtime parameter

Returns

integer —

final buffer size

computeBufferDensities()

computeBufferDensities() 

Computes from scratch the buffer densities of the documents in the current candidate pool. This is an expensive operation that requires the computation of the KL-divergence between each ordered pair of documents in the pool, approximately O(N^2) computations, total (where N is the number of documents in the pool). The densities are saved in the buffer data structure.

The density of a document is approximated by its average overlap with every other document in the candidate buffer, where the overlap between two documents is itself approximated using the exponential, negative KL-divergence between them. The KL-divergence is smoothed to deal with features (terms) that occur in one distribution (document) but not the other, and then multiplied by a negative constant and exponentiated in order to convert it to a kind of linear overlap score.

findNextDocumentToLabel()

findNextDocumentToLabel() : array

Finds the next best document for labeling amongst the documents in the candidate buffer, moves that candidate to the front of the buffer, and returns it. The best candidate is the one with the maximum product of disagreement and density, where the density has already been calculated for each document in the current pool, and the disagreement is the KL-divergence between the classification scores obtained from a committee of Naive Bayes classifiers, each sampled from the current set of features.

Returns

array —

two-element array containing first the best candidate, and second the disagreement score, obtained by dividing the disagreement for the document by the maximum disagreement possible for the committee size

train()

train(boolean  $update_accuracy = false) 

Trains the Naive Bayes classification algorithm used during labeling on the current training set, and optionally updates the estimated accuracy.

Parameters

boolean $update_accuracy

optional parameter specifying whether or not to update the accuracy estimate after training completes; defaults to false

updateAccuracy()

updateAccuracy(object  $X = null, array  $y = null) 

Estimates current classification accuracy using a Naive Bayes classification algorithm. Accuracy is estimated by splitting the current training set into fifths, reserving four fifths for training, and the remaining fifth for testing. A fresh classifier is trained and tested on these splits, and the total accuracy recorded. Then the splits are rotated so that the previous testing fifth becomes part of the training set, and one of the blocks from the previous training set becomes the testing set. A new classifier is trained and tested on the new splits, and, again, the accuracy recorded. This process is repeated until all blocks have been used for testing, and the average accuracy recorded.

Parameters

object $X

optional sparse matrix representing the already-mapped training set to use; if not provided, the current training set is mapped using the label_features property

array $y

optional array of document labels corresponding to the training set; if not provided the current training set labels are used

finalize()

finalize() 

Trains the final classification algorithm on the full training set, using a subset of the full feature set. The final algorithm will usually be logistic regression, but can be set to Naive Bayes with the appropriate runtime option. Once finalization completes, updates the `finalized' attribute.

classify()

classify(array  $page) : float

Classifies a page summary using the current final classification algorithm and features, and returns the classification score. This method is also used during the labeling phase to provide a tentative label for candidates, and in this case the final algorithm is actually a reference to a Naive Bayes instance and final_features is a reference to label_features; neither of these gets saved to disk, however.

Parameters

array $page

page summary array for the page to be classified

Returns

float —

pseudo-probability that the page is a positive instance of the target class

addBufferDoc()

addBufferDoc(array  $page, boolean  $is_active = true) 

Adds a page to the end of the candidate buffer, keeping the associated statistics up to date. During active training, each document in the buffer is tokenized, and the terms weighted by frequency; the term frequencies across documents in the buffer are tracked as well. With no active training, the buffer is simply an array of page summaries.

Parameters

array $page

page summary for the document to add to the buffer

boolean $is_active

whether this operation is part of active training, in which case some extra statistics must be maintained

dropBufferDoc()

dropBufferDoc(boolean  $is_active = true) 

Removes the document at the front of the candidate buffer. During active training the cross-document statistics for terms occurring in the document being removed are maintained.

Parameters

boolean $is_active

whether this operation is part of active training, in which case some extra statistics must be maintained

moveBufferDocToFront()

moveBufferDocToFront(integer  $i) 

Moves a document in the candidate buffer up to the front, in preparation for a label request. The document is specified by its index in the buffer.

Parameters

integer $i

document index within the candidate buffer

tokenizeDescription()

tokenizeDescription(string  $description) : array

Tokenizes a string into a map from terms to within-string frequencies.

Parameters

string $description

string to tokenize

Returns

array —

associative array mapping terms to their within-string frequencies

loadProperties()

loadProperties() 

Loads class attributes from compressed, serialized files on disk, and stores their names so that they will be saved back to disk later. Each property (if it has been previously set) is stored in its own file under the classifier's data directory, named after the property. The file is compressed using gzip, but without gzip headers, so it can't actually be decompressed by the standard gzip utility. If a file doesn't exist, then the instance property is left untouched. The property names are passed as a variable number of arguments.

storeLoadedProperties()

storeLoadedProperties() 

Stores the data associated with each property name listed in the loaded_properties instance attribute back to disk. The data for each property is stored in its own serialized and compressed file, and made world-writable.

labelPage()

labelPage(array  $summary, array  $classifiers, \seekquarry\yioop\library\classifiers\array&  $active_classifiers, \seekquarry\yioop\library\classifiers\array&  $active_rankers) 

Given a page summary (passed by reference) and a list of classifiers, augments the summary meta words with the class label of each classifier that scores the summary above a threshold. This static method is used by fetchers to classify downloaded pages. In addition to the class label, the pseudo-probability that the document belongs to the class is recorded as well. This is recorded both as the score rounded down to the nearest multiple of ten, and as "<n>plus" for each multiple of ten, n, less than the score and greater than or equal to the threshold.

As an example, suppose that a classifier with class label `label' has determined that a document is a positive example with pseudo-probability 0.87 and threshold 0.5. The following meta words are added to the summary: class:label, class:label:80, class:label:80plus, class:label:70plus, class:label:60plus, and class:label:50plus.

Parameters

array $summary

page summary to classify, passed by reference

array $classifiers

list of Classifier instances, each prepared for classifying (via the prepareToClassify method)

\seekquarry\yioop\library\classifiers\array& $active_classifiers
\seekquarry\yioop\library\classifiers\array& $active_rankers

getClassifierList()

getClassifierList() : array

Returns an array of classifier instances currently stored in the classifiers directory. The array maps class labels to their corresponding classifiers, and each classifier is a minimal instance, containing only summary statistics.

Returns

array —

associative array of class labels mapped to their corresponding classifier instances

getClassifier()

getClassifier(string  $label) : object

Returns the minimal classifier instance corresponding to a class label, or null if no such classifier exists on disk.

Parameters

string $label

classifier's class label

Returns

object —

classifier instance with the relevant class label, or null if no such classifier exists on disk

loadClassifiersData()

loadClassifiersData(array  $labels) : array

Given a list of class labels, returns an array mapping each class label to an array of data necessary for initializing a classifier for that label. This static method is used to prepare a collection of classifiers for distribution to fetchers, so that each fetcher can classify pages as it downloads them. The only extra properties passed along in addition to the base classification data are the final features and final algorithm, both necessary for classifying new documents.

Parameters

array $labels

flat array of class labels for which to load data

Returns

array —

associative array mapping class labels to arrays of data necessary for initializing the associated classifier

newClassifierFromData()

newClassifierFromData(array  $data) : object

The dual of loadClassifiersData, this static method reconstitutes a Classifier instance from an array containing the necessary data. This gets called by each fetcher, using the data that it receives from the name server when establishing a new crawl.

Parameters

array $data

associative array mapping property names to their serialized and compressed data

Returns

object —

Classifier instance built from the passed-in data

setClassifier()

setClassifier(object  $classifier) 

Stores a classifier instance to disk, first separating it out into individual files containing serialized and compressed property data. The basic classifier information, such as class label and summary statistics, is stored uncompressed in a file called `classifier.txt'.

The classifier directory and all of its contents are made world-writable so that they can be manipulated without hassle from the command line.

Parameters

object $classifier

Classifier instance to store to disk

deleteClassifier()

deleteClassifier(string  $label) 

Deletes the directory corresponding to a class label, and all of its contents. In the case that there is no classifier with the passed in label, does nothing.

Parameters

string $label

class label of the classifier to be deleted

cleanLabel()

cleanLabel(string  $label) 

Removes all but alphanumeric characters and underscores from a label, so that it may be easily saved to disk and used in queries as a meta word.

Parameters

string $label

class label to clean

getCrawlMixName()

getCrawlMixName(string  $label) : string

Returns a name for the crawl mix associated with a class label.

Parameters

string $label

class label associated with the crawl mix

Returns

string —

name that can be used for the crawl mix associated with $label

makeKey()

makeKey(array  $page) : string

Returns a key that can be used internally to refer internally to a particular page summary.

Parameters

array $page

page summary to return a key for

Returns

string —

key that uniquely identifies the page summary

klDivergenceToMean()

klDivergenceToMean(array  $ps) : float

Calculates the KL-divergence to the mean for a collection of discrete two-element probability distributions. Each distribution is specified by a single probability, p, since the second probability is just 1 - p. The KL-divergence to the mean is used as a measure of disagreement between members of a committee of classifiers, where each member assigns a classification score to the same document.

Parameters

array $ps

probabilities describing several discrete two-element probability distributions

Returns

float —

KL-divergence to the mean for the collection of distributions