其他分享
首页 > 其他分享> > Deep Learning Review

Deep Learning Review

作者:互联网

8-2 image classification

1x1 Conv

filter:

\[F_1 ,1 ,1 \]

where \(F_1\) is the number of channels. Original input:

\[(N,C,H,W) \]

then it's transformed to:

\[(N,C,H,W)\rightarrow (N,F_1,H,W) \]

So 1x1 conv filters can be used to change the dimensionality in the filter space.

1x1 convolutions are used to compute reductions before the expensive 3x3 and 5x5 convolutions

Auxiliary Classifier

Auxiliary Classifiers are type of architectural component that seek to improve the convergence of very deep networks.

They are classifier heads we attach to layers before the end of the network.

The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.

This is motivated by the reasonable performance of shallow networks that indicates early layers already encode informative and invariant features.

AlexNet

data augmentation during training to reduce over-fitting:

During test: the prediction is averaged over five random crops and their horizontal reflections

Convolutionization

The convolutionized version of a fully connected network re-uses computation of early layers to do the computation of the classifier

Overfeat

In their “overfeat” approach, Sermanet et al. (2013) combined this with a stride 1 final max-pooling to get multiple predictions.

they could afford parsing the scene at \(6\) scales to improve invariance.

The “overfeat” version of AlexNet which computes the convolution on the full image, and only “move” the fully connected layers on the output of the feature extractor.

\(\textbf{Advantages:}\)

Summary

• standard ones are extensions of LeNet5,
• everybody loves ReLU,
• state-of-the-art networks have \(100\)s of channels and \(10\)s of layers,
• they can (should?) be fully convolutional,
pass-through connections allow deeper “residual” nets,
bottleneck local structures reduce the number of parameters,
aggregated pathways reduce the number of parameters

8-3 object detection

This was mitigated in overfeat (Sermanet et al., 2013) by adding a regression part to predict the object’s bounding box.

In the single-object case, the convolutional layers are frozen, and the localization layers are trained with a \(L_2\) loss.

\(\textbf{Note:}\)

Region proposals

Other approaches:

\(\textbf{Disadvantages:}\)
These methods suffer from the cost of the region proposal computation, which is non-convolutional and not implementable on GPU.

They were improved by Ren et al. (2015) in “Faster R-CNN” by replacing the region proposal algorithm with a convolutional processing similar to Overfeat.

YOLO

Details: refer Blog

\(\textbf{Notes:}\)

\(\large\text{Engineering Tricks}\)

Pre-train the 20 first convolutional layers on ImageNet classification,
• use 448 × 448 input for detection, instead of 224 × 224,
• use Leaky ReLU for all layers,
dropout after the first fully connected layer,
normalize bounding boxes parameters in \([0, 1]\),
• use a quadratic loss not only for the bounding box coordinates, but also for the confidence and the class scores,
• reduce the weight of large bounding boxes by using the square roots of the size in the loss,
• reduce the importance of empty cells by weighting less the confidence-related loss on them,
• use momentum \(0.9\), decay \(5e − 4\),
data augmentation with scaling, translation, and HSV transformation.

SSD

The \(\textbf{Single Shot Multi-box Detector}\) (SSD, Liu et al., 2015) improves upon YOLO with a fully-convolutional architectures and multi-scale maps.

Summary for 'One-shot'

9-2 Looking at the activations

We have already seen PCA and \(k\)-means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold.

\(Notes:\)

t-SNE

Optimizes with SGD the \(y_i\)s so that the distributions of distances to close neighbors of each point are preserved.

It actually matches for \(D_{KL}\) two distance-dependent distributions: Gaussian in the original space, and Student t-distribution in the low-dimension one

9-3 Visualizing in Input

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image.

Guided Back-Propagation

Discarding structures which would not contribute positively to the final response, and discarding structures which are not already present:

\[\mathbf{1}_{\{s>0\}} \mathbf{1}_{\left\{\frac{\partial \ell}{\partial x}>0\right\}} \frac{\partial \ell}{\partial x} \]

which keeps only units which have a positive contribution and activation

Grad-CAM

It computes a sum of the activations weighted by the average gradient of the output of interest w.r.t. individual channels.

Details Refer to Blog

Optimizing Inputs

Since \(f\) is trained in a discriminative manner, a sample \(\hat{x}\) maximizing it has no reason to be “realistic"

We can mitigate this by adding a penalty \(h\) corresponding to a “realistic” prior, that is compute:

\[x^{*}=\underset{x}{\operatorname{argmax}} f(x ; w)-h(x) \]

A reasonable \(h\) penalizes too much energy in the high frequencies by integrating edge amplitude at multiple scales.

This can be formalized as a penalty function \(h\) of the form:

\[h(x)=\sum_{s \geq 0}\left\|\delta^{s}(x)-g \circledast \delta^{s}(x)\right\|^{2} \]

where \(g\) is a Gaussian kernel, and \(\delta\) is a downscale-by-two operator

The quadratic form of this penalty makes it lower when the energy is spread-out across terms.

10-1 Auto-regression

Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.

They rely on the chain rule from probability theory:

\[\begin{aligned} &\forall x_{1}, \ldots, x_{T}, P\left(X_{1}=x_{1}, \ldots, X_{T}=x_{T}\right)= \\ &P\left(X_{1}=x_{1}\right) P\left(X_{2}=x_{2} \mid X_{1}=x_{1}\right) \ldots P\left(X_{T}=x_{T} \mid X_{1}=x_{1}, \ldots, X_{T-1}=x_{T-1}\right) \end{aligned} \]

10-2 Causal convolutions

Auto-regression: During training, even though the full sequence is known, common computation is lost.

\(\textbf{Notes:}\)

11-1 Adversarial Network

The approach is adversarial since the two networks have antagonistic objectives.

z_dim = 8
nb_hidden = 100
 
model_G = nn.Sequential(nn.Linear(z_dim, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 2))
 
model_D = nn.Sequential(nn.Linear(2, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 1),
                        nn.Sigmoid())
 
batch_size, lr = 10, 1e-3
 
optimizer_G = optim.Adam(model_G.parameters(), lr = lr)
optimizer_D = optim.Adam(model_D.parameters(), lr = lr)
 
for e in range(nb_epochs):
    for t, real_batch in enumerate(real_samples.split(batch_size)):
        z = real_batch.new(real_batch.size(0), z_dim).normal_()
        fake_batch = model_G(z)
 
        D_scores_on_real = model_D(real_batch)
        D_scores_on_fake = model_D(fake_batch)
 
        if t%2 == 0:
            loss = (1 - D_scores_on_fake).log().mean()
            optimizer_G.zero_grad()
            loss.backward()
            optimizer_G.step()
        else:
            loss = - (1 - D_scores_on_fake).log().mean() \
                    - D_scores_on_real.log().mean()
            optimizer_D.zero_grad()
            loss.backward()
            optimizer_D.step()

\(\textbf{Notes:}\)

\(\large\text{Notes:}\)
Training a standard GAN often results in two pathological behaviors:

11-2 Wasserstein-GAN

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\min _{q \in \Pi\left(\mu, \mu^{\prime}\right)} \mathbb{E}_{\left(X, X^{\prime}\right) \sim q}\left[\left\|X-X^{\prime}\right\|\right] \]

So while it would make a lot of sense to look for a generator matching the density for this metric, that is:

\[G^* = \arg\min_G \mathbb{W}\left(\mu, \mu_G\right) \]

Rewrite:

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\max _{\|f\|_{L} \leq 1} \mathbb{E}_{X \sim \mu}[f(X)]-\mathbb{E}_{X \sim \mu^{\prime}}[f(X)] \]

where

\[\|f\|_{L}=\max _{x, x^{\prime}} \frac{\left\|f(x)-f\left(x^{\prime}\right)\right\|}{\left\|x-x^{\prime}\right\|} \]

As the result:

\[\begin{aligned} \mathbf{G}^{*} &=\underset{\mathbf{G}}{\operatorname{argmin}} \mathbb{W}\left(\mu, \mu_{\mathbf{G}}\right) \\ &=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\|\mathbf{D}\|_{L} \leq 1}\left(\mathbb{E}_{X \sim \mu}[\mathbf{D}(X)]-\mathbb{E}_{X \sim \mu_{\mathbf{G}}}[\mathbf{D}(X)]\right), \end{aligned} \]

2 benefits:

\(\large\textbf{Notes:}\)

Spectral Normalization

Spectral Normalization is a layer normalization that estimates the largest singular value of a weight matrix, and rescale it accordingly.

11-3 conditional-GAN

However, most of the practical applications require the ability to sample a conditional distribution. E.g.:

The Conditional GAN proposed by Mirza and Osindero (2014) consists of parameterizing both \(G\) and \(D\) by a conditioning quantity \(Y\):

\[V(\mathbf{D}, \mathbf{G})=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(X, Y)]+\mathbb{E}_{Z \sim \mathcal{N}(0, I), Y \sim \mu_{Y}}[\log (1-\mathbf{D}(\mathbf{G}(Z, Y), Y))] \]

Define:

\[\begin{aligned} V(\mathbf{D}, \mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(Y, X)]+\mathbb{E}_{Z \sim \mu_{Z}, X \sim \mu_{X}}[\log (1-\mathbf{D}(\mathbf{G}(Z, X), X))] \\ \mathscr{L}_{L^{1}}(\mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu, Z \sim \mathcal{N}(0, I)}\left[\|Y-\mathbf{G}(Z, X)\|_{1}\right] \end{aligned} \]

and

\[\mathbf{G}^{*}=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\mathbf{D}} V(\mathbf{D}, \mathbf{G})+\lambda \mathscr{L}_{L^{1}}(\mathbf{G}) . \]

\(\Large\textbf{Notes:}\) Note that contrary to Mirza and Osindero’s convention, here \(X\) is the conditioning quantity and \(Y\) the signal to generate

The key aspect of the GAN here is the “perceptual loss” that the discriminator implements, more than the theoretical convergence to the true distribution.

12-1 RNN

Temporal Convolutions

The simplest approach to sequence processing is to use Temporal Convolutional Networks.

Thanks to dilated convolutions, the model size is \(O(\log T)\). The memory footprint and computation are \(O(T \log T)\).

12-2 LSTM and GRU

\[c_t = c_{t-1} + i_t\odot g_t \]

where \(c_t\) is a recurrent state, \(i_t\) is a gating function and \(g_t\) is a full update. This assures that the derivatives of the loss w.r.t. \(c_t\) does not vanish.

\[\begin{aligned} &f_{t}=\operatorname{sigm}\left(W_{(x f)}{x_{t}}+W_{(h f)} h_{t-1}+b_{(f)}\right) \quad \text { (forget gate) }\\ &i_{t}=\operatorname{sigm}\left(W_{(x i)} x_{t}+W_{(h i)} h_{t-1}+b_{(i)}\right) \quad \text { (input gate) }\\ &g_{t}=\tanh \left(W_{(x c)}{x_{t}}+W_{(h c)} h_{t-1}+b_{(c)}\right) \quad \text { (full cell state update) }\\ &c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot g_{t} \quad \text { (cell state) }\\ &o_{t}=\operatorname{sigm}\left(W_{(xo) }x_{t}+W_{(h o)} h_{t-1}+b_{(o)}\right) \quad \text { (output gate) }\\ &h_{t}=o_{t} \odot \tanh \left(c_{t}\right) \quad \text { (output state) } \end{aligned} \]

\(\Large\textbf{Note:}\)

\(\LARGE\textbf{Notes:}\)

\[\widetilde{\nabla f}=\frac{\nabla f}{\|\nabla f\|} \min (\|\nabla f\|, \delta) \]

12-3 word-embeddings-and-translation

The geometry after embedding should account for synonymy, but also for identical word classes, etc. E.g. we would like such an embedding to make “cat” and “tiger” close, but also “red” and “blue”, or “eat” and “work”, etc

A common word embedding is the Continuous Bag of Words (CBOW) version of word2vec:

\(\Large\textbf{Details:}\)

\[\sum_{t}-\log \left(\frac{\exp \psi(t)_{k_t}}{\sum_{k=1}^{W} \exp \psi(t)_{k}}\right) \]

However, given the vocabulary size, doing so is numerically unstable and computationally demanding.

\[loss = \sum_{t}\left(\log \left(1+e^{-\psi(t)_{k_{t}}}\right)+\sum_{q=1}^{Q} \log \left(1+e^{\psi(t)_{\lambda_{t, q}}}\right)\right) \]

\[\sum_{t} y_{t} \log \left(1+\exp \left(-x_{t}\right)\right)+\left(1-y_{t}\right) \log \left(1+\exp \left(x_{t}\right)\right) \]

13-1 Attention for Memory and Sequence Translation

Attention-based processing: to transport information from parts of the signal to other parts dynamically identified

\(\Large\textbf{Notes:}\) Attention mechanisms aggregate features with an importance score that

Neural Turing Machine

Graves et al. (2014) proposed to equip a deep model with an explicit memory to allow for long-term storage and retrieval.

\(\Large\textbf{Notes:}\)

\[M_t\in \mathbb{R}^{N\times M} \]

where \(t\) is the time step, \(N\) is the number of entries in the memory and \(M\) is their dimension.

\[r_t = \sum_n w_t(n)M_t(n) \]

\[\forall n, M_{t}(n)=M_{t-1}(n)\left(1-w_{t}(n) e_{t}\right)+w_{t}(n) a_{t} \]

Attention Mechanisms

\[a: \mathbb{R}^{D'}\times \mathbb{R}^D\rightarrow \mathbb{R} \]

model params: \(\theta\in \mathbb{R}^{T\times D}\), this operation takes a “value” tensor as input: \(V\in \mathbb{R}^{T'\times D'}\), and computes an output: \(Y\in\mathbb{R}^{T\times D'}\)

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \frac{\exp \left(a\left(V_{i} ; \theta_{j}\right)\right)}{\sum_{k=1}^{T} \exp \left(a\left(V_{k} ; \theta_{j}\right)\right)} V_{i} \\ &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(V_{i} ; \theta_{j}\right)\right) V_{i} \end{aligned} \]

\[\forall j=1, \ldots, T, Y_{j}=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(C_{j}, V_{i} ; \theta\right)\right) V_{i} \]

标签:mathbb,layers,right,mathbf,Review,Deep,mu,Learning,left
来源: https://www.cnblogs.com/xinyu04/p/16402074.html