In machine learning, we need map categories strings like ‘dog’, ‘cat’ to intger number.
in this post, I list some ‘to categroies’ tools
sklearn
sklearn have a pattern to do this:
- get dataset
- feed dataset to encoder use
fit()
to learn whole picture mapping
- feed dataset to encoder use
transform()
to get mapped dataset
LabelEncoder
will covert string to integer number but not one hot type
e.g, [‘cat’, ‘dog’] map to [1, 2]
use numpy to convert integer numbver to ont hot
from 0.20, CategoricalEncoder
will directly covert string to one hot.
keras
keras use Tokenizer
to map strings to integer.
it have pattern similiar to sklearn
- get dataset
- feed dataset to tokenizer use
fit_on_texts()
.
- feed dataset to tokenizer use
texts_to_sequences()
.
different with sklearn, tokenizer will not use 0
index.
e.g, [‘cat’, ‘dog’] map to [1, 2]
Tokenizer
will not convert to one hot format.
use keras.utils.to_categorical
to convert integer sequence to one hot.
because there is no 0
index, but category will contains it, so here need add 1 for
the lenth.