pytorch pack_padded_sequence和pad_packed_sequence
作者:互联网
问题
当我们进行batch个训练数据一起计算的时候,我们会遇到多个训练样例长度不同的情况,这样我们就会很自然的进行padding,将短句子padding为跟最长的句子一样。
问题是,句子“Yes”只有一个单词,但是padding了5的pad符号,这样会导致LSTM对它的表示通过了非常多无用的字符,这样得到的句子表示就会有误差。
RNN对变长序列的处理
主要是用函数torch.nn.utils.rnn.pack_padded_sequence()和torch.nn.utils.rnn.pad_packed_sequence(),中间涉及到一个类torch.nn.utils.rnn.PackedSequence。
一开始会分不清两个函数,从字面上去理解:
step 1
.pack_padded_sequence(),padded:变成序列已经padding好,pack:对这样的序列进行打包。函数的参数:
input (Tensor) – padded batch of variable length sequences.
lengths (Tensor) – list of sequences lengths of each batch element.
batch_first (bool, optional) – if True, the input is expected in B x T x * format.
enforce_sorted (bool, optional) – if True, the input is expected to contain sequences sorted by length in a decreasing order. If False, the input will get sorted unconditionally. Default: True.
lengths,batch里面每个序列的长度,因为知道了每个序列的长度,才能知道每个序列处理到多长停止。
batch_first ,常见的参数,规定batch在哪个维度
enforce_sorted ,当True才需要对序列根据长度来排序,input[:,0] should be the longest sequence, and input[:,B-1] the shortest one.
step 2
该函数会返回一个PackedSequence对象,
它的属性
PackedSequence.data (Tensor) – Tensor containing packed sequence
PackedSequence.batch_sizes (Tensor) – Tensor of integers holding information about the batch size at each sequence step
PackedSequence.sorted_indices (Tensor, optional) – Tensor of integers holding how this PackedSequence is constructed from sequences.
PackedSequence.unsorted_indices (Tensor, optional) – Tensor of integers holding how this to recover the original sequences with correct order.
batch_sizes,这里不是每个序列的长度,而是每个time step要处理序列的数量。如图,第一个时间处理的batch为5,第三个时间处理的batch为3,。。。
sorted_indices ,从无序到降序(又长到短)的索引
unsorted_indices ,从排序恢复到原来序列的索引
看个例子就明白:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
>>> seq = torch.tensor([[1,2,0], [3,0,0], [4,5,6]])
>>> lens = [2, 1, 3]
>>> packed = pack_padded_sequence(seq, lens, batch_first=True, enforce_sorted=False)
>>> packed
PackedSequence(data=tensor([4, 1, 3, 5, 2, 6]), batch_sizes=tensor([3, 2, 1]),
sorted_indices=tensor([2, 0, 1]), unsorted_indices=tensor([1, 2, 0]))
pack_padded_sequence之后的数据对应上图是
[4,5,6]
[1,2,0]
[3,0,0]
batch_sizes,依次处理3,2,1个序列
细心的你可能发现RNN的数据有3个维度【batch,seq_len,hidden_size】
原因:
input can be of size T x B x * where T is the length of the longest sequence (equal to lengths[0]), B is the batch size, and * is any number of dimensions (including 0).
step 3
embed_input_x_packed = pack_padded_sequence(embed_input_x, sentence_lens, batch_first=True)
encoder_outputs_packed, (h_last, c_last) = self.lstm(embed_input_x_packed)
返回的h_last和c_last就是剔除padding字符后的hidden state和cell state,都是Variable类型的。(各个句子的表示,lstm只会作用到它实际长度的句子,而不是通过无用的padding字符)
返回的output是PackedSequence类型的,那么现在要同一长度作后续处理。
.pad_packed_sequence()这个操作和pack_padded_sequence()是相反的。把压紧的序列再填充回来。函数的参数:
sequence (PackedSequence) – batch to pad
batch_first (bool, optional) – if True, the output will be in B x T x * format.
padding_value (float, optional) – values for padded elements.
total_length (int, optional) – if not None, the output will be padded to have length total_length. This method will throw ValueError if total_length is less than the max sequence length in sequence.
padding_value默认值0.0
函数的返回:
Tuple of Tensor containing the padded sequence, and a Tensor containing the list of lengths of each sequence in the batch. Batch elements will be re-ordered as they were ordered originally when the batch was passed to pack_padded_sequence or pack_sequence.
返回两个值(1)padded sequence,同一长度。而且序列从降序变为原来序列(2)每个序列的真实长度
>>> seq_unpacked, lens_unpacked = pad_packed_sequence(packed, batch_first=True)
>>> seq_unpacked
tensor([[1, 2, 0],
[3, 0, 0],
[4, 5, 6]])
>>> lens_unpacked
tensor([2, 1, 3])
通过具体代码
这里引用一位博主
import torch
input_tensor = torch.tensor([[1, 3, 5, 6, 2, 0, 0],
[1, 3, 5, 0, 0, 0, 0],
[1, 3, 0, 0, 0, 0, 0]])
embe = torch.nn.Embedding(10, 6)
out = embe(input_tensor)
print(out.shape)
# 结果
#torch.Size([3, 7, 6])
gru = torch.nn.GRU(6, 8, batch_first=True)
hidden_normal = torch.zeros(1, 3, 8)
output_normal, _ = gru(out, hidden_normal)
print(output_normal.shape)
print(output_normal)
# 结果
torch.Size([3, 7, 8])
tensor([[[-0.3121, 0.0188, -0.1041, -0.1437, -0.4423, 0.2555, 0.3690,
0.2136],
[ 0.1832, -0.2063, -0.0339, -0.3196, -0.6962, 0.2769, 0.3495,
0.0115],
[ 0.3326, -0.3881, 0.0615, -0.2771, -0.4755, 0.2857, 0.3597,
-0.4412],
[ 0.1384, -0.0065, 0.2262, -0.4853, -0.6944, -0.0467, 0.5761,
-0.3320],
[ 0.2038, 0.0938, -0.1772, -0.4974, -0.5730, -0.3191, 0.6605,
-0.3210],
[ 0.2620, 0.1287, -0.4169, -0.4849, -0.5390, -0.4803, 0.6889,
-0.2553],
[ 0.3085, 0.1449, -0.5499, -0.4641, -0.5323, -0.5623, 0.6914,
-0.1874]],
[[-0.3121, 0.0188, -0.1041, -0.1437, -0.4423, 0.2555, 0.3690,
0.2136],
[ 0.1832, -0.2063, -0.0339, -0.3196, -0.6962, 0.2769, 0.3495,
0.0115],
[ 0.3326, -0.3881, 0.0615, -0.2771, -0.4755, 0.2857, 0.3597,
-0.4412],
[ 0.3520, -0.0209, -0.2198, -0.3820, -0.4272, -0.1440, 0.5585,
-0.3779],
[ 0.3638, 0.0925, -0.4146, -0.4223, -0.4525, -0.3903, 0.6477,
-0.2921],
[ 0.3740, 0.1334, -0.5358, -0.4313, -0.4838, -0.5156, 0.6777,
-0.2139],
[ 0.3830, 0.1506, -0.6060, -0.4272, -0.5031, -0.5743, 0.6832,
-0.1545]],
[[-0.3121, 0.0188, -0.1041, -0.1437, -0.4423, 0.2555, 0.3690,
0.2136],
[ 0.1832, -0.2063, -0.0339, -0.3196, -0.6962, 0.2769, 0.3495,
0.0115],
[ 0.2611, 0.0369, -0.2880, -0.3972, -0.4885, -0.1832, 0.5467,
-0.1531],
[ 0.3099, 0.1091, -0.4588, -0.4242, -0.4651, -0.4192, 0.6352,
-0.1683],
[ 0.3431, 0.1368, -0.5622, -0.4278, -0.4822, -0.5314, 0.6664,
-0.1414],
[ 0.3657, 0.1500, -0.6208, -0.4224, -0.4980, -0.5815, 0.6743,
-0.1111],
[ 0.3810, 0.1572, -0.6524, -0.4151, -0.5059, -0.6024, 0.6746,
-0.0875]]], grad_fn=<TransposeBackward1>)
hidden = torch.zeros(1, 3, 8)
out_pad = torch.nn.utils.rnn.pack_padded_sequence(out, torch.tensor([4, 3, 2]), batch_first=True)
output, _ = gru(out_pad, hidden)
encoder_outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
#以最大长度(非补齐输入序列)补齐输出长度。
print(encoder_outputs.shape)
print(encoder_outputs)
# 结果
torch.Size([3, 4, 8])
tensor([[[-0.3147, 0.2937, 0.3170, 0.0374, 0.0856, 0.1972, 0.1793,
-0.1815],
[-0.1413, -0.2737, 0.4023, -0.0043, -0.1145, 0.0961, 0.0909,
-0.1149],
[-0.2327, 0.0745, 0.5349, 0.0076, 0.1540, 0.1582, 0.2454,
-0.2582],
[-0.1467, -0.2010, 0.4935, 0.0996, -0.3427, 0.2260, 0.0455,
0.0056]],
[[-0.3147, 0.2937, 0.3170, 0.0374, 0.0856, 0.1972, 0.1793,
-0.1815],
[-0.1413, -0.2737, 0.4023, -0.0043, -0.1145, 0.0961, 0.0909,
-0.1149],
[-0.2327, 0.0745, 0.5349, 0.0076, 0.1540, 0.1582, 0.2454,
-0.2582],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000]],
[[-0.3147, 0.2937, 0.3170, 0.0374, 0.0856, 0.1972, 0.1793,
-0.1815],
[-0.1413, -0.2737, 0.4023, -0.0043, -0.1145, 0.0961, 0.0909,
-0.1149],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000]]], grad_fn=<TransposeBackward0>)
encoder
import torch
import torch.nn as nn
import torch.nn.functional as F
class Encoder(nn.Module):
def __init__(self, input_size, hidden_size, n_layer=1, drop_out=0):
# input_size是指单词数量,hidden_size为gru的hidden的feature
super(Encoder, self).__init__()
self.hidden_size = hidden_size
self.embeddding = nn.Embedding(input_size, self.hidden_size)
self.gru = nn.GRU(self.hidden_size, self.hidden_size, bidirectional=True,
num_layers=n_layer, dropout=(0 if n_layer == 1 else drop_out))
def forward(self, input_seq, length, hidden=None):
# input_seq应该为[seq_len, batch]
# length为input_seq未补齐时的真实长度排序,最大的在前,list(int)
# embedd.shape = [seq_len, batch, hidden_size]
embedd = self.embeddding(input_seq)
pack = torch.nn.utils.rnn.pack_padded_sequence(embedd, length)
output, hidden = self.gru(pack, hidden)
output, _ = torch.nn.utils.rnn.pad_packed_sequence(output)
# encoder_output的shape=[max(length), batch, hidden_size]
# hidden_output的shape=[2, batch, hidden_size]
encoder_output = output[:, :, :self.hidden_size] + output[:, :, self.hidden_size:]
return encoder_output, hidden
标签:sequence,torch,batch,padded,pytorch,0.0000,hidden,size 来源: https://blog.csdn.net/qq_41329791/article/details/110731127