\seekquarry\yioop\executablesClassifierTool

Class used to encapsulate all the activities of the ClassifierTool.php command line script. This script allows one to automate the building and testing of classifiers, providing an alternative to the web interface when

a labeled training set is available.

Summary

Methods
Properties
Constants
__construct()
parseOptions()
main()
runTrainAndTest()
runActiveTrainAndTest()
makeFreshClassifier()
deleteClassifier()
loadDataset()
isTestPoint()
testClassifier()
log()
logOptions()
setOptions()
setDefault()
$options
No constants found
No protected methods found
$classifier_controller
$crawl_model
N/A
No private methods found
No private properties found
N/A

Properties

$options

$options : array

Options to be used by activities and constructed classifiers. These options can be overridden by supplying an appropriate flag on the command line, where nesting is denoted by a period (e.g., cls.chi2.max).

The supported options are:

debug: An integer, the level of debug statements to print. Larger integers specify more detailed debug output; the default value of 0 indicates no debug output.

max_train: An integer, the maximum number of examples to use when training a classifier. The default value of null indicates that all available training examples should be used.

test_interval: An integer, the number of new training examples to be added before a round of testing on ALL test instances is to be executed. With an interval of 5, for example, after adding five new training examples, the classifier would be finalized and used to classify all test instances. The error is reported for each round of testing. The default value of null indicates that testing should only occur after all training examples have been added.

split: An integer, the number of examples from the entire set of labeled examples to use for training. The remainder are used for testing.

cls.use_nb: A boolean, whether or not to use the Naive Bayes classification algorithm instead of the logistic regression one in order to finalize the classifier. The default value is false, indicating that logistic regression should be used.

cls.chi2.max: An integer, the maximum number of features to use when training the classifier. The default is a relatively conservative 200.

Type

array

$classifier_controller

$classifier_controller : object

Reference to a classifier controller, used to manipulate crawl mixes in the same way that the controller that handles web requests does.

Type

object

$crawl_model

$crawl_model : object

Reference to a crawl model object, also used to manipulate crawl mixes.

Type

object

Methods

__construct()

__construct() 

Initializes the classifier controller and crawl model that will be used to manage crawl mixes, used for iterating over labeled examples.

parseOptions()

parseOptions() : array

Parses the command-line options, returns the required arguments, and updates the member variable $options with any parameters. If any of the required arguments (activity, dataset, or label) are missing, then a message is printed and the program exits. The optional arguments used to set parameters directly modify the class state through the setOptions method.

Returns

array —

the parsed activity, dataset, and label

main()

main() 

Parses the options, and if an appropriate activity exists, calls the activity, passing in the label and dataset to be used; otherwise, prints an error and exits.

runTrainAndTest()

runTrainAndTest(string  $label, string  $dataset_name) 

Trains a classifier on a data set, testing at the specified intervals.

The testing interval is set by the test_interval parameter. Each time this activity is run a new classifier is created (replacing an old one with the same label, if necessary), and the classifier remains at the end.

Parameters

string $label

class label of the new classifier

string $dataset_name

name of the dataset to train and test on

runActiveTrainAndTest()

runActiveTrainAndTest(string  $label, string  $dataset_name) 

Like the TrainAndTest activity, but uses active training in order to choose the documents to add to the training set. The method simulates the process that an actual user would go through in order to label documents for addition to the training set, then tests performance at the specified intervals.

Parameters

string $label

class label of the new classifier

string $dataset_name

name of the dataset to train and test on

makeFreshClassifier()

makeFreshClassifier(string  $label) : object

Creates a new classifier for a label, first deleting any existing classifier with the same label.

Parameters

string $label

class label of the new classifier

Returns

object —

created classifier instance

deleteClassifier()

deleteClassifier(string  $label) 

Deletes an existing classifier, specified by its label.

Parameters

string $label

class label of the existing classifier

loadDataset()

loadDataset(string  $dataset_name, string  $class_label) : array

Fetches the summaries for pages in the indices specified by the passed dataset name. This method looks for existing indexes with names matching the dataset name prefix, and with suffix either "pos" or "neg" (ignoring case). The pages in these indexes are shuffled into one large array, and augmented with a TRUE_LABEL field that records which set they came from originally. The shuffled array is then split according to the `split' option, and all pages up to (but not including) the split index are used for the training set; the remaining pages are used for the test set.

Parameters

string $dataset_name

prefix of index names to draw examples from

string $class_label

class label of the classifier the examples will be used to train (used to name the crawl mix that iterates over each index)

Returns

array —

training and test datasets in an associative array with keys train' andtest', where each dataset is wrapped up in a PageIterator that implements the CrawlMixIterator interface.

isTestPoint()

isTestPoint(integer  $i, integer  $total) : boolean

Determines whether to run a classification test after a certain number of documents have been added to the training set. Whether or not to test is determined by the `test_interval' option, which may be either null, an integer, or a string. In the first case, testing only occurs after all training examples have been added; in the second case, testing occurs each time an additional constant number of training examples have been added; and in the final case, testing occurs on a fixed schedule of comma-separated offsets, such as "10,25,50,100".

Parameters

integer $i

the size of the current training set

integer $total

the total number of documents available to be added to the training set

Returns

boolean —

true if the `test_interval' option specifies that a round of testing should occur for the current training offset, and false otherwise

testClassifier()

testClassifier(object  $classifier, array  $data) 

Finalizes the current classifier, uses it to classify all test documents, and logs the classification error. The current classifier is saved to disk after finalizing (though not before), and left in `classify' mode. The iterator over the test dataset is reset for the next round of testing (if any).

Parameters

object $classifier

classifier instance to test

array $data

the array of training and test datasets, constructed by loadDataset, of which only the `test' dataset it used.

log()

log() 

Writes out logging information according to a detail level. The first argument is an integer (potentially negative) indicating the level of detail for the log message, where larger numbers indicate greater detail. Each message is prefixed with a character according to its level of detail, but if the detail level is greater than the level specified by the `debug' option then nothing is printed. The treatment for the available detail levels are as follows:

-2: Used for errors; always printed; prefix '! ' -1: Used for log of set options; always printed; prefix '# ' 0+: Used for normal messages; prefix '> '

The second argument is a printf-style string template specifying the message, and each following (optional) argument is used by the template. A newline is added automatically to each message.

logOptions()

logOptions(string  $root = null, string  $prefix = '') 

Logs the current options using the log method of this class. This method is used to explicitly state which settings were used for a given run of an activity. The detail level passed to the log method is -1.

Parameters

string $root

folder to write to

string $prefix

to pre message (like Warning) to put at start of log message

setOptions()

setOptions(string|array  $opts, string  $converter = null) 

Sets one or more options of the form NAME=VALUE according to a converter such as intval, floatval, and so on. The options may be passed in either as a string (a single option) or as an array of strings, where each string corresponds to an option of the same type (e.g., int).

Parameters

string|array $opts

single option in the format NAME=VALUE, or array of options, each for the same target type (e.g., int)

string $converter

the name of a function that takes a string and casts it to a particular type (e.g., intval, floatval)

setDefault()

setDefault(string  $name, string  $value) 

Sets a default value for a runtime parameter. This method is used by activities to specify default values that may be overridden by passing the appropriate command-line flag.

Parameters

string $name

should end with name of runtime parameter to set

string $value

what to set it to