其他分享
首页 > 其他分享> > Modern Recurrent Neural Networks

Modern Recurrent Neural Networks

作者:互联网

Reference:

https://d2l.ai/chapter_recurrent-modern/index.html 9.1-9.4

Content

Motivation

Gated Recurrent Units (GRU)

Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

GRU supports gating of the hidden state, which decides when a hidden state should be updated and also when it should be reset.

Reset Gate and Update Gate

Given the input of the current time step and the hidden state of the previous time step, the outputs of two gates are given by two fully-connected layers with a sigmoid activation function:

X t ∈ R n × d \mathbf X_t \in \mathbb R^{n\times d} Xt​∈Rn×d: a minibatch with batch size n n n and d d d inputs

H t − 1 ∈ R n × h \mathbf H_{t-1}\in \mathbb R^{n\times h} Ht−1​∈Rn×h: the hidden state of the previous time step ( h h h=number of hidden units)

R t ∈ R n × h \mathbf R_t\in \mathbb R^{n\times h} Rt​∈Rn×h: the reset gate

Z t ∈ R n × h \mathbf Z_t \in \mathbb R^{n\times h} Zt​∈Rn×h: the update gate

W x r , W x z ∈ R d × h \mathbf W_{xr},\mathbf W_{xz}\in \mathbb R^{d\times h} Wxr​,Wxz​∈Rd×h and W h r , W h z ∈ R h × h \mathbf W_{hr}, \mathbf W_{hz}\in \mathbb R^{h\times h} Whr​,Whz​∈Rh×h: weight parameters

b r , b z ∈ R 1 × h \mathbf b_r, \mathbf b_z\in \mathbb R^{1\times h} br​,bz​∈R1×h: biases
R t = σ ( X t W x r + H t − 1 W h r + b r ) Z t = σ ( X t W x z + H t − 1 W h z + b z ) (GRU.1) \mathbf R_t=\sigma(\mathbf X_t \mathbf W_{xr}+\mathbf H_{t-1}\mathbf W_{hr}+\mathbf b_r)\\ \mathbf Z_t=\sigma(\mathbf X_t \mathbf W_{xz}+\mathbf H_{t-1}\mathbf W_{hz}+\mathbf b_z) \tag{GRU.1} Rt​=σ(Xt​Wxr​+Ht−1​Whr​+br​)Zt​=σ(Xt​Wxz​+Ht−1​Whz​+bz​)(GRU.1)
../_images/gru-1.svg

Hidden State

Next, let us integrate the reset gate R t \mathbf R_t Rt​ with the regular latent state updating mechanism. It leads to the following candidate hidden state H ~ t ∈ R n × h \tilde {\mathbf H}_t\in \mathbb R^{n\times h} H~t​∈Rn×h at time step t t t:
H ~ t = tanh ⁡ ( X t W x h + ( R t ⊙ H t − 1 ) W h h + b h ) (GRU.2) \tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h) \tag{GRU.2} H~t​=tanh(Xt​Wxh​+(Rt​⊙Ht−1​)Whh​+bh​)(GRU.2)
where W x h ∈ R d × h \mathbf{W}_{xh} \in \mathbb{R}^{d \times h} Wxh​∈Rd×h and W h h ∈ R h × h \mathbf{W}_{hh} \in \mathbb{R}^{h \times h} Whh​∈Rh×h are weight parameters, b h ∈ R 1 × h \mathbf{b}_h \in \mathbb{R}^{1 \times h} bh​∈R1×h is the bias, and the symbol ⊙ \odot ⊙ is the Hadamard (elementwise) product operator.

../_images/gru-2.svg

Finally, we need to incorporate the effect of the update gate Z t \mathbf Z_t Zt​. This determines the extent to which the new hidden state H t ∈ R n × h \mathbf H_t\in \mathbb R^{n\times h} Ht​∈Rn×h is just the old state H t − 1 \mathbf H_{t-1} Ht−1​ and by how much the new candidate state H ~ t \tilde {\mathbf H}_t H~t​ is used:
H t = Z t ⊙ H t − 1 + ( 1 − Z t ) ⊙ H ~ t (GRU.3) \mathbf H_t = \mathbf Z_t \odot \mathbf H_{t-1}+(1-\mathbf Z_t)\odot \tilde {\mathbf H}_t \tag{GRU.3} Ht​=Zt​⊙Ht−1​+(1−Zt​)⊙H~t​(GRU.3)
../_images/gru-3.svg

Long Short-Term Memory (LSTM)

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

LSTM introduces a memory cell that has the same shape as the hidden state (some literatures consider the memory cell as a special type of the hidden state), engineered to record additional information.

Input Gate, Forget Gate, and Output Gate

To control the memory cell we need a number of gates.

The motivation for such a design is the same as that of GRUs, namely to be able to decide when to remember and when to ignore inputs in the hidden state via a dedicated mechanism.

Just like in GRUs, the data feeding into the LSTM gates are the input at the current time step and the hidden state of the previous time step. They are processed by three fully-connected layers with a sigmoid activation function to compute the values of the input, forget. and output gates. As a result, values of the three gates are in the range of ( 0 , 1 ) (0,1) (0,1).

I t ∈ R n × h \mathbf I_t\in \mathbb R^{n\times h} It​∈Rn×h: the input gate

O t ∈ R n × h \mathbf O_t\in \mathbb R^{n\times h} Ot​∈Rn×h: the output gate

F t ∈ R n × h \mathbf F_t \in \mathbb R^{n\times h} Ft​∈Rn×h: the forget gate

W x i , W x o , W x f ∈ R d × h \mathbf W_{xi},\mathbf W_{xo},\mathbf W_{xf}\in \mathbb R^{d\times h} Wxi​,Wxo​,Wxf​∈Rd×h and W h i , W h o , W h f ∈ R h × h \mathbf W_{hi}, \mathbf W_{ho}, \mathbf W_{hf}\in \mathbb R^{h\times h} Whi​,Who​,Whf​∈Rh×h: weight parameters

b i , b o , b f ∈ R 1 × h \mathbf b_i, \mathbf b_o, \mathbf b_f\in \mathbb R^{1\times h} bi​,bo​,bf​∈R1×h: biases
I t = σ ( X t W x i + H t − 1 W h i + b i ) O t = σ ( X t W x o + H t − 1 W h o + b o ) F t = σ ( X t W x f + H t − 1 W h f + b f ) (LSTM.1) \begin{aligned} \mathbf I_t&=\sigma(\mathbf X_t \mathbf W_{xi}+\mathbf H_{t-1}\mathbf W_{hi}+\mathbf b_i)\\ \mathbf O_t&=\sigma(\mathbf X_t \mathbf W_{xo}+\mathbf H_{t-1}\mathbf W_{ho}+\mathbf b_o) \\ \mathbf F_t&=\sigma(\mathbf X_t \mathbf W_{xf}+\mathbf H_{t-1}\mathbf W_{hf}+\mathbf b_f) \end{aligned}\tag{LSTM.1} It​Ot​Ft​​=σ(Xt​Wxi​+Ht−1​Whi​+bi​)=σ(Xt​Wxo​+Ht−1​Who​+bo​)=σ(Xt​Wxf​+Ht−1​Whf​+bf​)​(LSTM.1)
../_images/lstm-0.svg

Memory Cell

Next we design the memory cell. Since we have not specified the action of the various gates yet, we first introduce the candidate memory cell C ~ t ∈ R n × h \tilde {\mathbf C}_t∈\mathbb R^{n×h} C~t​∈Rn×h. Its computation is similar to that of the three gates described above, but using a tanh ⁡ \tanh tanh function with a value range for ( − 1 , 1 ) (−1,1) (−1,1) as the activation function:
C ~ t = tanh ⁡ ( X t W x c + H t − 1 W h c + b c ) (LSTM.2) \tilde {\mathbf C}_t =\tanh(\mathbf X_t \mathbf W_{xc}+\mathbf H_{t-1}\mathbf W_{hc}+\mathbf b_c) \tag{LSTM.2} C~t​=tanh(Xt​Wxc​+Ht−1​Whc​+bc​)(LSTM.2)
where W x c ∈ R d × h \mathbf W_{xc}\in \mathbb R^{d\times h} Wxc​∈Rd×h and W h c ∈ R h × h \mathbf W_{hc}\in \mathbb R^{h\times h} Whc​∈Rh×h are the weight parameters and b c ∈ R 1 × h \mathbf b_c \in \mathbb R^{1\times h} bc​∈R1×h is a bias parameter.

../_images/lstm-1.svg

In GRUs, we have a mechanism to govern input and forgetting (or skipping). Similarly, in LSTMs we have two dedicated gates for such purposes: the input gate I t \mathbf I_t It​ governs how much we take new data into account via C ~ t \tilde{\mathbf C}_t C~t​ and the forget gate F t \mathbf F_t Ft​ addresses how much of the old memory cell content C t − 1 ∈ R n × h \mathbf C_{t-1}\in \mathbb R^{n\times h} Ct−1​∈Rn×h we retain:
C t = F t ⊙ C t − 1 + I t ⊙ C ~ t (LSTM.3) \mathbf C_t=\mathbf F_t\odot \mathbf C_{t-1}+\mathbf I_t \odot \tilde {\mathbf C}_t \tag{LSTM.3} Ct​=Ft​⊙Ct−1​+It​⊙C~t​(LSTM.3)
This design is introduced to alleviate the vanishing gradient problem and to better capture long range dependencies within sequences.

../_images/lstm-2.svg

Hidden State

Last, we need to define how to compute the hidden state H t ∈ R n × h \mathbf H_t \in \mathbb R^{n\times h} Ht​∈Rn×h. This is where the output gate comes into play. In LSTM it is simply a gated version of the tanh ⁡ \tanh tanh of the memory cell. This ensures that the values of H t \mathbf H_t Ht​ are always in the interval ( − 1 , 1 ) (−1,1) (−1,1):
H t = O t ⊙ tanh ⁡ ( C t ) (LSTM.4) \mathbf H_t=\mathbf O_t \odot \tanh (\mathbf C_t) \tag{LSTM.4} Ht​=Ot​⊙tanh(Ct​)(LSTM.4)
Whenever the output gate approximates 1 we effectively pass all memory information through to the predictor, whereas for the output gate close to 0 we retain all the information only within the memory cell and perform no further processing.

../_images/lstm-3.svg

Deep Recurrent Neural Networks

We could stack multiple layers of RNNs on top of each other. Each hidden state is continuously passed to both the next time step of the current layer and the current time step of the next layer.

../_images/deep-rnn.svg

X t ∈ R n × d \mathbf X_t \in \mathbb R^{n\times d} Xt​∈Rn×d: a minibatch with batch size n n n and d d d inputs

H t ( l ) ∈ R n × h \mathbf H_{t}^{(l)}\in \mathbb R^{n\times h} Ht(l)​∈Rn×h: the hidden state of the l t h l^{th} lth hidden layer ( h h h=number of hidden units), H t ( 0 ) = X t \mathbf H_{t}^{(0)}=\mathbf X_t Ht(0)​=Xt​.
H t ( l ) = ϕ l ( H t ( l − 1 ) W x h ( l ) + H t − 1 ( l ) W h h ( l ) + b h ( l ) ) (DRNN.1) \mathbf H_{t}^{(l)}=\phi_l(\mathbf H_{t}^{(l-1)}\mathbf W_{xh}^{(l)}+\mathbf H_{t-1}^{(l)} \mathbf W_{hh}^{(l)}+\mathbf b_{h}^{(l)}) \tag{DRNN.1} Ht(l)​=ϕl​(Ht(l−1)​Wxh(l)​+Ht−1(l)​Whh(l)​+bh(l)​)(DRNN.1)
In the end, the calculation of the output layer is only based on the hidden state of the final L t h L^{th} Lth hidden layer:
O t = H t ( L ) W h q + b q (DRNN.2) \mathbf O_t=\mathbf H_{t}^{(L)}\mathbf W_{hq}+\mathbf b_q \tag{DRNN.2} Ot​=Ht(L)​Whq​+bq​(DRNN.2)

Bidirectional Recurrent Neural Networks

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

To motivate why one might pick this specific architecture, we can first take a detour to probabilistic models.

Dynamic Programming in Hidden Markov Models

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0rEvAioR-1630835800664)(https://d2l.ai/_images/hmm.svg)]

For a sequence of T T T observations we have the following joint probability distribution over the observed and hidden states:
P ( x 1 , ⋯   , x T , h 1 , ⋯   , h T ) = ∏ t = 1 T P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) (HMM.1) P(x_1,\cdots, x_T,h_1,\cdots,h_T)=\prod_{t=1}^T P(h_t|h_{t-1})P(x_t|h_t)\tag{HMM.1} P(x1​,⋯,xT​,h1​,⋯,hT​)=t=1∏T​P(ht​∣ht−1​)P(xt​∣ht​)(HMM.1)
where P ( h 1 ∣ h 0 ) = P ( h 1 ) P(h_1|h_0)=P(h_1) P(h1​∣h0​)=P(h1​).

Now assume that we observe all x i x_i xi​ with the exception of some x j x_j xj​ and it is our goal to compute P ( x j ∣ x − j ) P(x_j|x_{-j}) P(xj​∣x−j​), where x − j = ( x 1 , ⋯   , x j − 1 , x j , ⋯   , x T ) x_{-j}=(x_1,\cdots, x_{j-1},x_j,\cdots, x_T) x−j​=(x1​,⋯,xj−1​,xj​,⋯,xT​).

Since there is no latent variable in P ( x j ∣ x − j ) P(x_j|x_{-j}) P(xj​∣x−j​), we consider summing over all the possible combinations of choices for h 1 , ⋯   , h T h_1,\cdots,h_T h1​,⋯,hT​. In case any h i h_i hi​ can take on k k k distinct values, this means that we need to sum over k T k^T kT terms, which is usually impossible! Fortunately there is an elegant solution for this: dynamic programming.

From the results above, we are able to compute
P ( x j ∣ x − j ) ∝ ∑ h j π j ( h j ) P ( x j ∣ h j ) ρ ( h j ) (HMM.7) P(x_j|x_{-j})\propto \sum_{h_j} \pi_j(h_j) P(x_j\mid h_j) \rho( h_j) \tag{HMM.7} P(xj​∣x−j​)∝hj​∑​πj​(hj​)P(xj​∣hj​)ρ(hj​)(HMM.7)
These recursions allow us to sum over T T T latent variables in O ( k T ) \mathcal O(kT) O(kT) (linear) time over all values of ( h 1 , ⋯   , h T ) (h_1,\cdots,h_T) (h1​,⋯,hT​) rather than in exponential time.

Bidirectional Model

If we want to have a mechanism in RNNs that offers comparable look-ahead ability as in hidden Markov models, we need to modify the RNN design that we have seen so far. Instead of running an RNN only in the forward mode starting from the first token, we start another one from the last token running from back to front. Bidirectional RNNs add a hidden layer that passes information in a backward direction to more flexibly process such information.

../_images/birnn.svg

In fact, this is not too dissimilar to the forward and backward recursions in the dynamic programming of hidden Markov models.

X t ∈ R n × d \mathbf X_t \in \mathbb R^{n\times d} Xt​∈Rn×d: a minibatch with batch size n n n and d d d inputs

H → t ∈ R n × h , H ← t ∈ R n × h \overrightarrow{\mathbf{H}}_t\in \mathbb R^{n\times h},\overleftarrow{\mathbf{H}}_t\in \mathbb R^{n\times h} H t​∈Rn×h,H t​∈Rn×h: the forward and backward hidden states for this time step

W x h ( f ) ∈ R d × h , W h h ( f ) ∈ R h × h , W x h ( b ) ∈ R d × h , W h h ( b ) ∈ R h × h \mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h},\mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h} Wxh(f)​∈Rd×h,Whh(f)​∈Rh×h,Wxh(b)​∈Rd×h,Whh(b)​∈Rh×h: weights

b h ( f ) ∈ R 1 × h , b h ( b ) ∈ R 1 × h \mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h} , \mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h} bh(f)​∈R1×h,bh(b)​∈R1×h: biases
H → t = ϕ ( X t W x h ( f ) + H → t − 1 W h h ( f ) + b h ( f ) ) , H ← t = ϕ ( X t W x h ( b ) + H ← t + 1 W h h ( b ) + b h ( b ) ) , (BM.1) \begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{hh}^{(f)} + \mathbf{b}_h^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{hh}^{(b)} + \mathbf{b}_h^{(b)}), \end{aligned} \tag{BM.1} H t​H t​​=ϕ(Xt​Wxh(f)​+H t−1​Whh(f)​+bh(f)​),=ϕ(Xt​Wxh(b)​+H t+1​Whh(b)​+bh(b)​),​(BM.1)
Last, the output layer computes the output O t ∈ R n × q \mathbf{O}_t \in \mathbb{R}^{n \times q} Ot​∈Rn×q (number of outputs: q q q)
O t = H t W h q + b q (BM.2) \mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q \tag{BM.2} Ot​=Ht​Whq​+bq​(BM.2)
Here, the weight matrix W h q ∈ R 2 h × q \mathbf{W}_{hq} \in \mathbb{R}^{2h \times q} Whq​∈R2h×q and the bias b q ∈ R 1 × q \mathbf{b}_q \in \mathbb{R}^{1 \times q} bq​∈R1×q are the model parameters of the output layer. In fact, the two directions can have different numbers of hidden units.

标签:mathbb,Recurrent,mathbf,Modern,mid,ht,times,hT,Networks
来源: https://blog.csdn.net/qq_39599295/article/details/120118537