其他分享
首页 > 其他分享> > Deep Learning Week13 Notes

Deep Learning Week13 Notes

作者:互联网

1. Attention for Memory and Sequence Translation

Attention mechanisms aggregate features with an importance score that:

\(\Large\text{Note:}\)

Neural Turing Machine

\(\large\textbf{Illustration: refer }\) Lecture-P6

The said module has an hidden internal state that takes the form of a tensor:

\[M_t\in \mathbb{R}^{N\times M} \]

where \(t\) is the time step, \(N\) is the number of entries in the memory and \(M\) is their dimension.

A “controller” is implemented as a standard feed-forward or recurrent model and at every iteration \(t\) it computes activations that modulate the reading / writing operations.

More formally, the memory module implements:

\[r_t = \sum_{n=1}^Nw_t(n)M_t(n) \]

\[\forall n, M_t(n) = M_{t-1}(n)[1-w_t(n)e_t]+w_t(n)a_t \]

The controller has multiple “heads”, and computes at each \(t\), for each writing head \(w_t, e_t, a_t\), and for each reading head \(w_t\), and gets back a read value \(r_t\).

Attention for seq2seq

Given an input sequence \(x_1,...,x_T\), the standard approach for sequence- to-sequence translation (Sutskever et al., 2014) uses a recurrent model:

\[h_t = f(x_t,h_{t-1}) \]

and considers that the final hidden state:

\[v = h_T \]

carries enough information to drive an auto-regressive generative model:

\[y_t\sim p(y_1,...,y_{t-1},v) \]

itself implemented with another RNN.

$\LARGE \star $ The main weakness of such an approach is that all the information has to flow through a single state \(v\), whose capacity has to accommodate any situation. There are no direct “channels” to transport local information from the input sequence to the place where it is useful in the resulting sequence.

Attention mechanisms (Bahdanau et al., 2014) can transport information from parts of the signal to other parts specified dynamically.

Bahdanau et al. (2014) proposed to extend a standard recurrent model with such a mechanism. They first run a bi-directionnal RNN to get a hidden state:

\[h_{i}=\left(h_{i}^{\rightarrow}, h_{i}^{\leftarrow}\right), \quad i=1, \ldots, T \]

From this, they compute a new process \(s_i,i = 1,...,T\) which looks weighted averages of the \(h_j\), where the weights are functions of the signal.

Given \(y_1,...,y_{i-1}\) and \(s_1,...,s_{i-1}\) first compute an attention:

\[\forall j, \alpha_{i, j}=\operatorname{softmax}_{j} a\left(s_{i-1}, h_{j}\right) \]

where \(a\) is a one hidden layer \(\tanh\) MLP. Then compute the context vector from \(h\):

\[c_i = \sum_{j=1}^T \alpha_{i,j} h_j \]

The model can now make the prediction:

\[\begin{align} s_i &= f(s_{i-1},y_{i-1},c_i)\\ y_i&\sim g(y_{i-1},s_i,c_i) \end{align} \]

where \(f\) is GRU.

\(\Large\textbf{Illustration: refer }\) Lecture-P20

2. Attention Mechanisms

\[a:\mathbb{R}^{D'}\times\mathbb{R}^D\rightarrow \mathbb{R} \]

and model parameters:

\[\theta\in \mathbb{R}^{T\times D} \]

this operation takes a “value” tensor as input:

\[V\in \mathbb{R}^{T'\times D'} \]

and compute the output:

\[Y\in\mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \frac{\exp \left(a\left(V_{i} ; \theta_{j}\right)\right)}{\sum_{k=1}^{T} \exp \left(a\left(V_{k} ; \theta_{j}\right)\right)} V_{i} \\ &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(V_{i} ; \theta_{j}\right)\right) V_{i} \end{aligned} \]

\[C\in \mathbb{R}^{T\times D} \]

and a "value" tensor:

\[V\in \mathbb{R}^{T'\times D} \]

computes a tensor

\[Y\in \mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(C_j,V_{i} ; \theta\right)\right) V_{i} \end{aligned} \]

\(\large\text{Illustration the difference: }\)Lecture-P4

Using the terminology of Graves et al. (2014), attention is an averaging of values associated to keys matching a query. Hence the keys used for computing attention and the values to average are different quantities.

Given a query sequence \(Q\in\mathbb{R}^{T\times D}\), a key sequence \(K\in \mathbb{R}^{T'\times D}\) and a value sequence \(V\in\mathbb{R}^{T'\times D'}\). Compute a matrix \(A\in \mathbb{R}^{T\times T'}\), by matching \(Q\) to \(K\), and weight \(V\) with it to get the result sequence \(Y\in\mathbb{R}^{T\times D'}\),

\[\begin{align} \forall i, A_i &= \text{softmax}(\frac{KQ_i}{\sqrt{D}})\\ Y_i &= V^TA_i \end{align} \]

or

\[\begin{align} A &= \text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})\in \mathbb{R}^{T\times T'}\\ Y&= AV\in\mathbb{R}^{T\times D'} \end{align} \]

The queries and keys have the same dimension \(D\), and there are as many keys \(T'\) as there are values. The result \(Y\) has as many rows \(T\) as there are queries, they are of same dimension \(D'\) as the values.

\(\large\text{Illustration: refer }\) Lecture-P9.

A standard attention layer takes as input two sequences \(X\) and \(X'\), and computes the tensors \(K,V,Q\) as the linear functions:

\[\begin{align} K&= W^KX\\ V&=W^VX\\ Q&=W^QX'\\ Y&=\text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})V \end{align} \]

When \(X = X'\) , this is self attention, otherwise it is cross attention.

Multi-head attention combines several such operations in parallel, and \(Y\) is the concatenation of the results along the feature dimension.

\(\Large\textbf{Note:}\)

\(\text{batch matrix product}\): torch.matmul()

>>> a = torch.rand(11, 9, 2, 3)
>>> b = torch.rand(11, 9, 3, 4)
>>> m = a.matmul(b)
>>> m.size()
torch.Size([11, 9, 2, 4])
>>>
>>> m[7, 1]
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>> a[7, 1].mm(b[7, 1])
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>>
>>> m[3, 0]
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])
>>> a[3, 0].mm(b[3, 0])
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])

\(\text{Attention layer Code:}\)

class AttentionLayer(nn.Module):
    def __init__(self, in_channels, out_channels, key_channels):
        super().__init__()
        self.conv_Q = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_K = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_V = nn.Conv1d(in_channels, out_channels, kernel_size = 1, bias = False)
    
    def forward(self, x):
        Q = self.conv_Q(x)
        K = self.conv_K(x)
        V = self.conv_V(x)
        A = Q.transpose(1, 2).matmul(K).softmax(2)
        y = A.matmul(V.transpose(1, 2)).transpose(1, 2)
        return y

The computation of the attention matrix \(A\) and the layer’s output \(Y\) could also be expressed somehow more clearly with Einstein summations:

A = torch.einsum('nct,ncs->nts', Q, K).softmax(2)
y = torch.einsum('nts,ncs->nct', A, V)

Positional Encoding

>>> len = 20
>>> c = math.ceil(math.log(len) / math.log(2.0))
>>> o = 2**torch.arange(c).unsqueeze(1)
>>> pe = (torch.arange(len).unsqueeze(0).div(o, rounding_mode = 'floor')) % 2
>>> pe
tensor([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

3. Transformer Networks

\(\Large\text{Illustration: refer }\) Lecture-P2

\[\begin{aligned} \operatorname{Attention}(Q, K, V) &=\operatorname{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_{k}}}\right) V \\ \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(H_{1}, \ldots, H_{h}\right) W^{O} \\ H_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right), i=1, \ldots, h \end{aligned} \]

where

\[W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}, \quad W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}} \]

\(\textbf{Positional information:}\)

\[\begin{gathered} P E_{t, 2 i}=\sin \left(\frac{t}{10,000^{\frac{2 i}{d_{\text {model }}}}}\right) \\ P E_{t, 2 i+1}=\cos \left(\frac{t}{10,000^{\frac{2 i+1}{d_{\text {model }}}}}\right) . \end{gathered} \]

\(\Large\text{Overall Illustration: refer }\) Lecture-P5

BERT (Bidirectional Encoder Representation from Transformers, Devlin et al., 2018) is a transformer pre-trained with:

\(\Large\text{Illustration: refer }\) Lecture-P14

\(\text{GPT: a transformer trained for auto-regressive text generation}\) Lecture-P18

We can use HuggingFace’s pre-trained models:

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

tokens = tokenizer.encode('Studying Deep-Learning is')

for k in range(100): # no more than 100 tokens
    outputs = model(torch.tensor([tokens])).logits
    next_token = torch.argmax(outputs[0, -1])
    tokens.append(next_token)
    if tokenizer.decode([next_token]) == '.': break

print(tokenizer.decode(tokens))

Vision Transformers

\(\Large\text{Illustration: refer }\) Lecture-P31

标签:mathbb,Week13,text,Notes,attention,times,Learning,model,left
来源: https://www.cnblogs.com/xinyu04/p/16353961.html