自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器
作者:互联网
自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型
本文是将葡萄牙语翻译成英语的一个高级示例。
目录
- 安装部署 Tensorflow
- 设置输入pipeline
- 从训练数据集创建自定义子词分词器subwords tokenizer
- 如果单词不在词典中,则分词器(tokenizer)通过将单词分解为子词来对字符串进行编码。
- 将开始和结束标记(token)添加到输入和目标
- 为了使示例较小且相对较快,删除长度大于40个标记的样本
- 附录 最终的运行结果
- 参考文献
- 星空智能对话机器人系列博客
安装部署 Tensorflow
import tensorflow_datasets as tfds
import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt
运行报错,提示
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-4c94d8100fcf> in <module>
----> 1 import tensorflow_datasets as tfds
2 import tensorflow as tf
3
4 import time
5 import numpy as np
ModuleNotFoundError: No module named 'tensorflow_datasets'
安装tensorflow_datasets
(base) C:\Users\admin>activate my_star_space
(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Collecting tensorflow-datasets
Using cached tensorflow_datasets-4.4.0-py3-none-any.whl (4.0 MB)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Collecting tensorflow-metadata
Downloading tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB)
|████████████████████████████████| 48 kB 21 kB/s
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.13.0)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Collecting absl-py
Downloading absl_py-0.12.0-py3-none-any.whl (129 kB)
|████████████████████████████████| 129 kB 14 kB/s
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
Installing collected packages: absl-py, tensorflow-metadata, tensorflow-datasets
Attempting uninstall: absl-py
Found existing installation: absl-py 0.13.0
Uninstalling absl-py-0.13.0:
Successfully uninstalled absl-py-0.13.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
Successfully installed absl-py-0.12.0 tensorflow-datasets-4.4.0 tensorflow-metadata-1.2.0
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.
(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Requirement already satisfied: tensorflow-datasets in e:\anaconda3\envs\my_star_space\lib\site-packages (4.4.0)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.12.0)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: tensorflow-metadata in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.2.0)
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.
(my_star_space) C:\Users\admin>
设置输入pipeline
使用 TFDS 来导入 葡萄牙语-英语翻译数据集,该数据集来自于 TED 演讲开放翻译项目. 数据集包含来约 50000 条训练样本,1100 条验证样本,以及 2000 条测试样本。
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']
下载的时间较长,运行结果如下
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0...
Dl Completed...: 100%
1/1 [2:57:36<00:00, 10649.11s/ url]
Dl Size...: 100%
124/124 [2:57:36<00:00, 93.26s/ MiB]
Extraction completed...: 100%
1/1 [2:57:36<00:00, 10656.49s/ file]
Dataset ted_hrlr_translate downloaded and prepared to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0. Subsequent calls will reuse this data.
下载的文件保存在本地
其中dataset_info.json的内容为
{
"citation": "@inproceedings{Ye2018WordEmbeddings,\n author = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},\n title = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},\n booktitle = {HLT-NAACL},\n year = {2018},\n }",
"configDescription": "Translation dataset from pt to en in plain text.",
"configName": "pt_to_en",
"description": "Data sets derived from TED talk transcripts for comparing similar language pairs\nwhere one is high resource and the other is low resource.",
"downloadSize": "131005909",
"fileFormat": "tfrecord",
"location": {
"urls": [
"https://github.com/neulab/word-embeddings-for-nmt"
]
},
"moduleName": "tensorflow_datasets.translate.ted_hrlr",
"name": "ted_hrlr_translate",
"splits": [
{
"name": "train",
"numBytes": "10806586",
"shardLengths": [
"51785"
]
},
{
"name": "validation",
"numBytes": "231285",
"shardLengths": [
"1193"
]
},
{
"name": "test",
"numBytes": "383883",
"shardLengths": [
"1803"
]
}
],
"supervisedKeys": {
"input": "pt",
"output": "en"
},
"version": "1.0.0"
}
features.json文件的内容为:
{
"type": "tensorflow_datasets.core.features.translation_feature.Translation",
"content": {
"languages": [
"en",
"pt"
]
}
}
ted_hrlr_translate-test.tfrecord-00000-of-00001的格式
从训练数据集创建自定义子词分词器subwords tokenizer
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)
运行脚本如下
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-c90f5c60daf2> in <module>
----> 1 tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
2 (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
3
4 tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
5 (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)
AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'
提示报错,使用tfds.deprecated.text
tokenizer_en =tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)
sample_string = 'Transformer is awesome.'
tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
original_string = tokenizer_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
assert original_string == sample_string
运行结果如下
Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string: Transformer is awesome.
如果单词不在词典中,则分词器(tokenizer)通过将单词分解为子词来对字符串进行编码。
我们看一下示例
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
运行结果如下:
7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former
13 ----> is
2799 ----> awesome
7877 ----> .
将开始和结束标记(token)添加到输入和目标
BUFFER_SIZE = 20000
BATCH_SIZE = 64
def encode(lang1, lang2):
lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
lang1.numpy()) + [tokenizer_pt.vocab_size+1]
lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
lang2.numpy()) + [tokenizer_en.vocab_size+1]
return lang1, lang2
为了使示例较小且相对较快,删除长度大于40个标记的样本
MAX_LENGTH = 40
def filter_max_length(x, y, max_length=MAX_LENGTH):
return tf.logical_and(tf.size(x) <= max_length,
tf.size(y) <= max_length)
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# 将数据集缓存到内存中以加快读取速度。
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)
val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)
pt_batch, en_batch = next(iter(val_dataset))
pt_batch, en_batch
运行结果如下
(<tf.Tensor: shape=(64, 38), dtype=int64, numpy=
array([[8214, 342, 3032, ..., 0, 0, 0],
[8214, 95, 198, ..., 0, 0, 0],
[8214, 4479, 7990, ..., 0, 0, 0],
...,
[8214, 584, 12, ..., 0, 0, 0],
[8214, 59, 1548, ..., 0, 0, 0],
[8214, 118, 34, ..., 0, 0, 0]], dtype=int64)>,
<tf.Tensor: shape=(64, 40), dtype=int64, numpy=
array([[8087, 98, 25, ..., 0, 0, 0],
[8087, 12, 20, ..., 0, 0, 0],
[8087, 12, 5453, ..., 0, 0, 0],
...,
[8087, 18, 2059, ..., 0, 0, 0],
[8087, 16, 1436, ..., 0, 0, 0],
[8087, 15, 57, ..., 0, 0, 0]], dtype=int64)>)
附录 最终的运行结果
葡萄牙语翻译成英语
参考文献
https://tensorflow.google.cn/tutorials/text/transformer
星空智能对话机器人系列博客
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 多头注意力架构-通过Python实例计算Q, K, V
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 多头注意力架构 Concatenation of the output of the heads
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 位置编码(positional_encoding)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT ByteLevelBPETokenizer
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Initializing model
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Exploring the parameters
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Initializing the trainer
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Language modeling with FillMaskPipeline
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 GLUE Winograd schemas and NER
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Workshop on Machine Translation (WMT)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Pattern-Exploiting Training (PET)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 The philosophy of Pattern-Exploiting Training (PET)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 It‘s time to make a decision
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step3-5
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step 6-8
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step 9
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model
-
NLP星空智能对话机器人系列:论文学习 Do Transformers Really Perform Bad for Graph Representation
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 2 to 6
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 7 to 9
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 10
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 T5-large transformer model
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Architecture of the T5 model
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Summarizing documents with T5-large
-
自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Matching datasets and tokenizers
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Standard NLP tasks with specific vocabulary
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 T5 Bill of Rights Sample
-
NLP星空智能对话机器人系列:论文解读 How Good is Your Tokenizer? (你的词元分析器有多好?多语言模型的单语性能研究)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Semantic Role Labeling (SRL).
-
自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 SRL(Semantic Role Labeling) Sample 2
-
自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 SRL(Semantic Role Labeling)
-
NLP星空智能对话机器人系列:NLP on Transformers 101 Transformers 101 第21章 基于Bayesian Theory的MRC文本理解基础经典模型算法
-
NLP星空智能对话机器人系列:NLP on Transformers 101 第7章 BERT Fine-tuning源码完整实现、调试及案例实战
-
NLP星空智能对话机器人系列:NLP on Transformers 101 第8章 轻量级ALBERT模型剖析及BERT变种中常见模型优化方式详解
标签:NLP,Transformer,datasets,星空,机器人,分词器,tensorflow 来源: https://blog.csdn.net/duan_zhihua/article/details/121479623