\seekquarry\yioop\library\classifiersLassoRegression

Implements the logistic regression text classification algorithm using lasso regression and a cyclic coordinate descent optimization step.

This algorithm is rather slow to converge for large datasets or a large number of features, but it does provide regularization in order to combat over-fitting, and out-performs Naive-Bayes in tests on the same data set. The algorithm augments a standard cyclic coordinate descent approach by ``sleeping'' features that don't significantly change during a single step. Each time an optimization step for a feature doesn't change the feature weight beyond some threshold, that feature is forced to sit out the next optimization round. The threshold increases over successive rounds, effectively placing an upper limit on the number of iterations over all features, while simultaneously limiting the number of features updated on each round. This optimization speeds up convergence, but at the cost of some accuracy.

Summary

Methods
Properties
Constants
log()
train()
classify()
computeApproxLikelihood()
score()
estimateLambdaNorm()
$debug
$epsilon
$lambda
$beta
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$debug

$debug : integer

Level of detail to be used for logging. Higher values mean more detail.

Type

integer

$epsilon

$epsilon : float

Threshold used to determine convergence.

Type

float

$lambda

$lambda : float

Lambda parameter to CLG algorithm.

Type

float

$beta

$beta : array

Beta vector of feature weights resulting from the training phase. The dot product of this vector with a feature vector yields the log likelihood that the feature vector describes a document belonging to the trained-for class.

Type

array

Methods

log()

log(string  $message) 

Write a message to log file depending on debug level for this subpackage

Parameters

string $message

what to write to the log

train()

train(object  $X, array  $y) 

An adaptation of the Zhang-Oles 2001 CLG algorithm by Genkin et al. to use the Laplace prior for parameter regularization. On completion, optimizes the beta vector to maximize the likelihood of the data set.

Parameters

object $X

SparseMatrix representing the training dataset

array $y

array of known labels corresponding to the rows of $X

classify()

classify(array  $x) 

Returns the pseudo-probability that a new instance is a positive example of the class the beta vector was trained to recognize. It only makes sense to try classification after at least some training has been done on a dataset that includes both positive and negative examples of the target class.

Parameters

array $x

feature vector represented by an associative array mapping features to their weights

computeApproxLikelihood()

computeApproxLikelihood(object  $Xj, array  $y, array  $r, float  $d) : array

Computes the approximate likelihood of y given a single feature, and returns it as a pair <numerator, denominator>.

Parameters

object $Xj

iterator over the non-zero entries in column j of the data

array $y

labels corresponding to entries in $Xj; each label is 1 if example i has the target label, and -1 otherwise

array $r

cached dot products of the beta vector and feature weights for each example i

float $d

trust region for feature j

Returns

array —

two-element array containing the numerator and denominator of the likelihood

score()

score(array  $r, array  $y, array  $beta) : float

Computes an approximate score that can be used to get an idea of how much a given optimization step improved the likelihood of the data set.

Parameters

array $r

cached dot products of the beta vector and feature weights for each example i

array $y

labels for each example

array $beta

beta vector of feature weights (used to penalize large weights)

Returns

float —

value proportional to the likelihood of the data, penalized by the magnitude of the beta vector

estimateLambdaNorm()

estimateLambdaNorm(object  $invX) : float

Estimates the lambda parameter from the dataset.

Parameters

object $invX

inverted X matrix for dataset (essentially a posting list of features in X)

Returns

float —

lambda estimate