$vocab
$vocab : array
Maps terms to their feature indices, which start at 1.
A concrete Features subclass that represents a document as a binary vector where a one indicates that a feature is present in the document, and a zero indicates that it is not. The absent features are ignored, so the binary vector is actually sparse, containing only those feature indices where the value is one.
Each document in the training set is expected to be fed through an instance of a subclass of this abstract class in order to convert it to a feature vector. Terms are replaced with feature indices (e.g., 'Pythagorean' => 1, 'theorem' => 2, and so on), which are contiguous. The value at a feature index is determined by the subclass; one might weight terms according to how often they occur in the document, while another might use a simple binary representation. The feature index 0 is reserved for an intercept term, which always has a value of one.
addExample(array $terms, integer $label) : array
Maps a new example to a feature vector, adding any new terms to the vocabulary, and updating term and label statistics. The example should be an array of terms and their counts, and the output simply replaces terms with feature indices.
array | $terms | array of terms mapped to the number of times they occur in the example |
integer | $label | label for this example, either -1 or 1 |
input example with terms replaced by feature indices
updateExampleLabel(array $features, integer $old_label, integer $new_label)
Updates the label and term statistics to reflect a label change for an example from the training set. A new label of 0 indicates that the example is being removed entirely. Note that term statistics only count one occurrence of a term per example.
array | $features | feature vector from when the example was originally added |
integer | $old_label | old example label in {-1, 1} |
integer | $new_label | new example label in {-1, 0, 1}, where 0 indicates that the example should be removed entirely |
varStats(integer $j, integer $label) : array
Returns the statistics for a particular feature and label in the training set. The statistics are counts of how often the term appears or fails to appear in examples with or without the target label. They are returned in a flat array, in the following order:
0 => # examples where feature present, label matches 1 => # examples where feature present, label doesn't match 2 => # examples where feature absent, label matches 3 => # examples where feature absent, label doesn't match
integer | $j | feature index |
integer | $label | target label |
feature statistics in 4-element flat array
restrict(object $fs) : object
Given a FeatureSelection instance, return a new clone of this Features instance using a restricted feature subset. The new Features instance is augmented with a feature map that it can use to convert feature indices from the larger feature set to indices for the reduced set.
object | $fs | FeatureSelection instance to be used to select the most informative terms |
new Features instance using the restricted feature set
mapToRestrictedFeatures(array $features) : array
Maps the indices of a feature vector to those used by a restricted feature set, dropping and features that aren't in the map. If this Features instance isn't restricted, then the passed-in features are returned unmodified.
array | $features | feature vector mapping feature indices to frequencies |
original feature vector with indices mapped according to the feature_map property, and any features that don't occcur in feature_map dropped
mapTrainingSet(array $docs) : object
Replaces term counts with 1, indicating only that a feature occurs in a document. When a Features instance is a subset of a larger instance, it will have a feature_map member that maps feature indices from the larger feature set to the smaller one. The indices must be mapped in this way so that the training set can retain complete information, only throwing away features just before training. See the abstract parent class for a more thorough introduction to the interface.
array | $docs | array of training examples represented as feature vectors where the values are per-example counts |
SparseMatrix instance whose rows are the transformed feature vectors
mapDocument(array $tokens) : array
Converts a map from terms to within-document term counts with the corresponding sparse binary feature vector used for classification.
array | $tokens | associative array of terms mapped to their within-document counts |
feature vector corresponding to the tokens, mapped according to the implementation of a particular Features subclass