Secrets of TensorFlow Tokenizer word_index 0
Recently I am picking up TensorFlow 2.0.
In the example code [here](total_words = len(tokenizer.word_index) + 1), when calculating the
total_word should be the
len(tokenizer.word_index) + 1, otherwise the
to_categorical function will not work.
We can easily find out that the
word_index is the dictionary maps words to the unique values, and the value starts from 1, the 0 is reserved.
The reason for this is very straightforward, but it took me a long time to work it out.
For a sentence lists, the max length of all sentences are 6, and a sentence like “Hi there!” is in sentence lists. When doing the padding, all the pads are filled up with 0, and the
to_categorical result is something like
[0, 0, 0, 0, 34, 371].
word_index will be the aforementioned one. If the value of
word_index starts from 0, say
"my": 0, the padding part will not be reversed properly. Therefore, the
0 in intentionally reserved, and when reversing the list value to words, the
0 will be maped to
None. In the meantime, the
to_categorical has a
num_classes, which is set to max value of the
word_index as default, and the padding for
to_categorical will be 0. So
Tokenizer corporates with
to_categorical very well.