The details of vowpal wabbit’s (vw
) feature representation are specified here but it can be tricky to grok immediately.
All features in vw
are numeric features. It doesn’t do anything special for text or categorical features - they are just numeric features like all other features.
Consider this vw formatted example:
|doc this is some text |stats views:10 |type post
The |
character indicates a namespace declaration and the characters after that are the name of the namespace1. What follows the namespace declaration is a list of features and their values.
So in this example we have the doc
namespace, stats
namespace and type
namespace.
The core of vw
s feature representation is that every feature is numeric and consists of a name and a value. This is specified in an example as name:value
If name is specified and value is omitted, then it has a default value of 1 2.
So for our example of |doc this is some text
, vw
treats doc
as a namespace and the other tokens as features. Since no value is specified for each feature, this is the same as |doc this:1 is:1 some:1 text:1
. Similarly, |type post
is equivalent to |type post:1
.
So vw
doesn’t do anything special for text features (or categorical features) - these features are treated as numeric features with default value 1, since value was omitted. It’s just the way vw
encodes features with defaults for unspecified values that makes it appear as if it supports text features.
It also worth understanding how vw
encodes features internally. This is useful if you start using ngram features, cross-features, or if you generate a readable model and want to interpret which features are important..
Internally vw
stores namespace names as single characters so the above example is equivalent to
|d this:1 is:1 some:1 text:1 |s views:10 |t post
vw
hashes features by namespace. So features in different namespaces end up being hashed as different tokens.
Internally vw
combines the namespace and feature name before hashing them. The features that get hashed in the above example would be
d^this
d^is
d^some
d^text
s^views
t^post
Note that a text token most likely has different hashes depending on which namespace it is in. So the same text token in two different namespaces is treated as as two different features.
So that’s the default featurization for text. The ability to generate ngrams and skipgrams is optional. If you provide the option --ngram d2
it will also generate 2-gram features for the doc namespace.
So in addition to the above features it would also generate these additional features and then hash them.
d^this*d^is
d^is*d^some
d^some*d^text
...
n-gram features are generated within a namespace. If you generate quatratic features, e.g. with -q dt
or -interactions dt
you would also end up features accross those 2 namespaces. E.g.:
d^this*t^post
d^is*t^post
...
For more on working with text in vw see this repository
And this is a good starting point for getting familiar with vw
.
-
Namespaces allow you to group features together. The main reason for grouping features in namespaces, is so that you can easily generate cross-features.) ↩
-
Also, the absence of a feature indicates that the feature has value 0. So for text classfification tasks 0 is assumed for all tokens not listed in an example. ↩