其他分享
首页 > 其他分享> > 【505】Using keras for word-level one-hot encoding

【505】Using keras for word-level one-hot encoding

作者:互联网

参考:Text Preprocessing —— Tokenizer

参考:Preprocessing » 文本预处理

  对于 Embedding 层使用的输入,就是整数矩阵,并不是真正的 one-hot 向量,需要利用 Tokenizer 来实现。

1. Tokenizer 

1.1 语法

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

  文本标记实用类。该类允许使用两种方法向量化一个文本语料库: 将每个文本转化为一个整数序列(每个整数都是词典中标记的索引); 或者将其转化为一个向量,其中每个标记的系数可以是二进制值、词频、TF-IDF权重等。

1.2 参数说明

  默认情况下,删除所有标点符号,将文本转换为空格分隔的单词序列(单词可能包含 ' 字符)。 这些序列然后被分割成标记列表。然后它们将被索引或向量化。

  0 是不会被分配给任何单词的保留索引。

1.3 类方法

1.4 属性

1.5 举例

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)

print("word_counts: \n", tokenizer.word_counts)
print("\ntotal words: \n", len(tokenizer.word_counts))

# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)

print("\nsequences:\n", sequences)

# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

print("\none_hot_results:\n", one_hot_results)

# This is how you can recover the word index that was computed
word_index = tokenizer.word_index

print("\nword_index:\n", word_index)
print('\nFound %s unique tokens.' % len(word_index))

  outputs:

word_counts: 
 OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])

total words: 
 9

sequences:
 [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

one_hot_results:
 [[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]

word_index:
 {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}

Found 9 unique tokens.

标签:index,word,keras,level,texts,words,sequences,文本
来源: https://www.cnblogs.com/alex-bn-lee/p/14193254.html