首页 > 其他分享> > Deep Learning Review

Deep Learning Review

2022-06-22 20:04:16 作者：互联网

8-2 image classification

1x1 Conv

filter:

\[F_1 ,1 ,1 \]

where \(F_1\) is the number of channels. Original input:

\[(N,C,H,W) \]

then it's transformed to:

\[(N,C,H,W)\rightarrow (N,F_1,H,W) \]

So 1x1 conv filters can be used to change the dimensionality in the filter space.

1x1 convolutions are used to compute reductions before the expensive 3x3 and 5x5 convolutions

Auxiliary Classifier

Auxiliary Classifiers are type of architectural component that seek to improve the convergence of very deep networks.

They are classifier heads we attach to layers before the end of the network.

The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.

This is motivated by the reasonable performance of shallow networks that indicates early layers already encode informative and invariant features.

AlexNet

data augmentation during training to reduce over-fitting:

crop a 224 × 224 image at a random position in the original 256 × 256, and randomly reflect it horizontally,
apply a color transformation using a PCA model of the color distribution

During test: the prediction is averaged over five random crops and their horizontal reflections

Convolutionization

The convolutionized version of a fully connected network re-uses computation of early layers to do the computation of the classifier

AlexNet: if one wants to apply the network at multiple locations of a large image, it should be done in a \(\textbf{sliding window fashion}\): Each position would be processed separately, and no computation would be shared.
Fully convolutional version: the computation of the early layers is performed only once at each location, and the classifier part receives an activation map which can be used by the convolutional filters.

Overfeat

In their “overfeat” approach, Sermanet et al. (2013) combined this with a stride 1 final max-pooling to get multiple predictions.

they could afford parsing the scene at \(6\) scales to improve invariance.

The “overfeat” version of AlexNet which computes the convolution on the full image, and only “move” the fully connected layers on the output of the feature extractor.

\(\textbf{Advantages:}\)

we can now re-use classification networks for dense prediction without re-training
it blurs the conceptual boundary between “features” and “classifier” and leads to an intuitive understanding of convnet activations as gradually transitioning from appearance to semantic.

Summary

• standard ones are extensions of LeNet5,
• everybody loves ReLU,
• state-of-the-art networks have \(100\)s of channels and \(10\)s of layers,
• they can (should?) be fully convolutional,
• pass-through connections allow deeper “residual” nets,
• bottleneck local structures reduce the number of parameters,
• aggregated pathways reduce the number of parameters

8-3 object detection

This was mitigated in overfeat (Sermanet et al., 2013) by adding a regression part to predict the object’s bounding box.

In the single-object case, the convolutional layers are frozen, and the localization layers are trained with a \(L_2\) loss.

\(\textbf{Note:}\)

This architecture can be applied directly to detection by adding a class “Background” to the object classes.
Negative samples are taken in each scene either at random or by selecting the ones with the worst miss-classification
Using class-specific localization layers did not provide better results than having a single one shared across classes

Region proposals

Other approaches:

Generate thousands of proposal bounding boxes with a non-CNN “objectness” approach such as Selective search
feed to an AlexNet-like network sub-images cropped and warped from the input image (“R-CNN”, Girshick et al., 2013), or from the convolutional feature maps to share computation (“Fast R-CNN”, Girshick, 2015).

\(\textbf{Disadvantages:}\)
These methods suffer from the cost of the region proposal computation, which is non-convolutional and not implementable on GPU.

They were improved by Ren et al. (2015) in “Faster R-CNN” by replacing the region proposal algorithm with a convolutional processing similar to Overfeat.

YOLO

Details: refer Blog

\(\textbf{Notes:}\)

Comes back to a classical architecture with a series of convolutional layers followed by a few fully connected layers.
It uses leaky ReLU, and its convolutional layers make use of the \(1 × 1\) bottleneck filters (Lin et al., 2013) to control the memory footprint and computational cost

\(\large\text{Engineering Tricks}\)

• Pre-train the 20 first convolutional layers on ImageNet classification,
• use 448 × 448 input for detection, instead of 224 × 224,
• use Leaky ReLU for all layers,
• dropout after the first fully connected layer,
• normalize bounding boxes parameters in \([0, 1]\),
• use a quadratic loss not only for the bounding box coordinates, but also for the confidence and the class scores,
• reduce the weight of large bounding boxes by using the square roots of the size in the loss,
• reduce the importance of empty cells by weighting less the confidence-related loss on them,
• use momentum \(0.9\), decay \(5e − 4\),
• data augmentation with scaling, translation, and HSV transformation.

SSD

The \(\textbf{Single Shot Multi-box Detector}\) (SSD, Liu et al., 2015) improves upon YOLO with a fully-convolutional architectures and multi-scale maps.

Summary for 'One-shot'

networks trained on image classification capture localization information,
regression layers can be attached to classification-trained networks,
object localization does not have to be class-specific,
multiple detection are estimated at each location to account for different aspect ratios and scales.

9-2 Looking at the activations

We have already seen PCA and \(k\)-means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold.

\(Notes:\)

\(k\)-means is a good methods when we have clusters, but not when having smooth and continuous manifold,
When the data is distributed along a curved manifold, PCA “wastes” dimensions to capture the curvature, even if its intrinsic dimension is small.

t-SNE

Optimizes with SGD the \(y_i\)s so that the distributions of distances to close neighbors of each point are preserved.

It actually matches for \(D_{KL}\) two distance-dependent distributions: Gaussian in the original space, and Student t-distribution in the low-dimension one

9-3 Visualizing in Input

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image.

Guided Back-Propagation

Discarding structures which would not contribute positively to the final response, and discarding structures which are not already present:

\[\mathbf{1}_{\{s>0\}} \mathbf{1}_{\left\{\frac{\partial \ell}{\partial x}>0\right\}} \frac{\partial \ell}{\partial x} \]

which keeps only units which have a positive contribution and activation

Grad-CAM

It computes a sum of the activations weighted by the average gradient of the output of interest w.r.t. individual channels.

Details Refer to Blog

Optimizing Inputs

Since \(f\) is trained in a discriminative manner, a sample \(\hat{x}\) maximizing it has no reason to be “realistic"

We can mitigate this by adding a penalty \(h\) corresponding to a “realistic” prior, that is compute:

\[x^{*}=\underset{x}{\operatorname{argmax}} f(x ; w)-h(x) \]

A reasonable \(h\) penalizes too much energy in the high frequencies by integrating edge amplitude at multiple scales.

This can be formalized as a penalty function \(h\) of the form:

\[h(x)=\sum_{s \geq 0}\left\|\delta^{s}(x)-g \circledast \delta^{s}(x)\right\|^{2} \]

where \(g\) is a Gaussian kernel, and \(\delta\) is a downscale-by-two operator

The quadratic form of this penalty makes it lower when the energy is spread-out across terms.

10-1 Auto-regression

Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.

They rely on the chain rule from probability theory:

\[\begin{aligned} &\forall x_{1}, \ldots, x_{T}, P\left(X_{1}=x_{1}, \ldots, X_{T}=x_{T}\right)= \\ &P\left(X_{1}=x_{1}\right) P\left(X_{2}=x_{2} \mid X_{1}=x_{1}\right) \ldots P\left(X_{T}=x_{T} \mid X_{1}=x_{1}, \ldots, X_{T-1}=x_{T-1}\right) \end{aligned} \]

10-2 Causal convolutions

Auto-regression: During training, even though the full sequence is known, common computation is lost.

\(\textbf{Notes:}\)

With the models we saw previously, the input differs from one position to another: when predicting a new component, both the mask and the value tensor are recomputed.
The precursor of all the state-of-the-art methods for voice synthesis are based on autoregressive models with dilated convolutions

11-1 Adversarial Network

The approach is adversarial since the two networks have antagonistic objectives.

z_dim = 8
nb_hidden = 100
 
model_G = nn.Sequential(nn.Linear(z_dim, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 2))
 
model_D = nn.Sequential(nn.Linear(2, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 1),
                        nn.Sigmoid())
 
batch_size, lr = 10, 1e-3
 
optimizer_G = optim.Adam(model_G.parameters(), lr = lr)
optimizer_D = optim.Adam(model_D.parameters(), lr = lr)
 
for e in range(nb_epochs):
    for t, real_batch in enumerate(real_samples.split(batch_size)):
        z = real_batch.new(real_batch.size(0), z_dim).normal_()
        fake_batch = model_G(z)
 
        D_scores_on_real = model_D(real_batch)
        D_scores_on_fake = model_D(fake_batch)
 
        if t%2 == 0:
            loss = (1 - D_scores_on_fake).log().mean()
            optimizer_G.zero_grad()
            loss.backward()
            optimizer_G.step()
        else:
            loss = - (1 - D_scores_on_fake).log().mean() \
                    - D_scores_on_real.log().mean()
            optimizer_D.zero_grad()
            loss.backward()
            optimizer_D.step()

\(\textbf{Notes:}\)

For even batches, the loss is only computed on the fake samples to optimize the generator, and D_scores_on_real is not used.
For odd batches, the loss is computed on all samples to optimize the discriminator.

\(\large\text{Notes:}\)
Training a standard GAN often results in two pathological behaviors:

Oscillations without convergence. Contrary to standard loss minimization, we have no guarantee here that it will actually decrease
The infamous “mode collapse”, when \(G\) models very well a small sub-population, concentrating on a few modes
Additionally, performance is hard to assess.
The Inception Score checks that when generated images are classified by an inception model (Szegedy et al., 2015) the estimated posterior distribution of classes is similar to the real class distribution, which in particular penalizes a missing class
The Fr´echet Inception Distance looks at the distributions of the features in one of the feature maps of the inception model, for the real and synthetic samples, and estimate their similarity under a Gaussian model.

11-2 Wasserstein-GAN

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\min _{q \in \Pi\left(\mu, \mu^{\prime}\right)} \mathbb{E}_{\left(X, X^{\prime}\right) \sim q}\left[\left\|X-X^{\prime}\right\|\right] \]

So while it would make a lot of sense to look for a generator matching the density for this metric, that is:

\[G^* = \arg\min_G \mathbb{W}\left(\mu, \mu_G\right) \]

Rewrite:

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\max _{\|f\|_{L} \leq 1} \mathbb{E}_{X \sim \mu}[f(X)]-\mathbb{E}_{X \sim \mu^{\prime}}[f(X)] \]

where

\[\|f\|_{L}=\max _{x, x^{\prime}} \frac{\left\|f(x)-f\left(x^{\prime}\right)\right\|}{\left\|x-x^{\prime}\right\|} \]

As the result:

\[\begin{aligned} \mathbf{G}^{*} &=\underset{\mathbf{G}}{\operatorname{argmin}} \mathbb{W}\left(\mu, \mu_{\mathbf{G}}\right) \\ &=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\|\mathbf{D}\|_{L} \leq 1}\left(\mathbb{E}_{X \sim \mu}[\mathbf{D}(X)]-\mathbb{E}_{X \sim \mu_{\mathbf{G}}}[\mathbf{D}(X)]\right), \end{aligned} \]

2 benefits:

A greater stability of the learning process, both in principle and in their experiments: they do not witness “mode collapse”.
A greater interpretability of the loss, which is a better indicator of the quality of the samples

\(\large\textbf{Notes:}\)

In the original GAN, no constraint is imposed on \(D\), which can easily be optimized to discriminate real from fake images. This makes \(G\) hard to train because the response of D, which is very confident, and the resulting gradient of the loss is therefore very small, and consequently the gradient w.r.t. \(G\)’s parameters is also very small
With Wasserstein GAN, due to the constrain on the discriminator, it does not saturate and there is always a gradient flowing back to the generator. However now the discriminator is harder to train since the gradient w.r.t. its parameters has to be clipped or projected in some way and may be set to zero. In some way the Wasserstein GAN trades the difficulty to optimize the generator for the difficulty to train the [regularized] discriminator.

Spectral Normalization

Spectral Normalization is a layer normalization that estimates the largest singular value of a weight matrix, and rescale it accordingly.

11-3 conditional-GAN

However, most of the practical applications require the ability to sample a conditional distribution. E.g.:

Next frame prediction where a frame is sampled given the preceding frames.
Image “in-painting”, where the missing part of an image is sampled given the available one.
Semantic segmentation, where the label map is sampled given the image.
Style transfer, where a picture in a certain style (e.g. a la Renoir), is sampled given the same image in another style (e.g. a la Picasso)

The Conditional GAN proposed by Mirza and Osindero (2014) consists of parameterizing both \(G\) and \(D\) by a conditioning quantity \(Y\):

\[V(\mathbf{D}, \mathbf{G})=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(X, Y)]+\mathbb{E}_{Z \sim \mathcal{N}(0, I), Y \sim \mu_{Y}}[\log (1-\mathbf{D}(\mathbf{G}(Z, Y), Y))] \]

Define:

\[\begin{aligned} V(\mathbf{D}, \mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(Y, X)]+\mathbb{E}_{Z \sim \mu_{Z}, X \sim \mu_{X}}[\log (1-\mathbf{D}(\mathbf{G}(Z, X), X))] \\ \mathscr{L}_{L^{1}}(\mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu, Z \sim \mathcal{N}(0, I)}\left[\|Y-\mathbf{G}(Z, X)\|_{1}\right] \end{aligned} \]

and

\[\mathbf{G}^{*}=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\mathbf{D}} V(\mathbf{D}, \mathbf{G})+\lambda \mathscr{L}_{L^{1}}(\mathbf{G}) . \]

\(\Large\textbf{Notes:}\) Note that contrary to Mirza and Osindero’s convention, here \(X\) is the conditioning quantity and \(Y\) the signal to generate

The key aspect of the GAN here is the “perceptual loss” that the discriminator implements, more than the theoretical convergence to the true distribution.

12-1 RNN

Temporal Convolutions

The simplest approach to sequence processing is to use Temporal Convolutional Networks.

Thanks to dilated convolutions, the model size is \(O(\log T)\). The memory footprint and computation are \(O(T \log T)\).

12-2 LSTM and GRU

\[c_t = c_{t-1} + i_t\odot g_t \]

where \(c_t\) is a recurrent state, \(i_t\) is a gating function and \(g_t\) is a full update. This assures that the derivatives of the loss w.r.t. \(c_t\) does not vanish.

\[\begin{aligned} &f_{t}=\operatorname{sigm}\left(W_{(x f)}{x_{t}}+W_{(h f)} h_{t-1}+b_{(f)}\right) \quad \text { (forget gate) }\\ &i_{t}=\operatorname{sigm}\left(W_{(x i)} x_{t}+W_{(h i)} h_{t-1}+b_{(i)}\right) \quad \text { (input gate) }\\ &g_{t}=\tanh \left(W_{(x c)}{x_{t}}+W_{(h c)} h_{t-1}+b_{(c)}\right) \quad \text { (full cell state update) }\\ &c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot g_{t} \quad \text { (cell state) }\\ &o_{t}=\operatorname{sigm}\left(W_{(xo) }x_{t}+W_{(h o)} h_{t-1}+b_{(o)}\right) \quad \text { (output gate) }\\ &h_{t}=o_{t} \odot \tanh \left(c_{t}\right) \quad \text { (output state) } \end{aligned} \]

\(\Large\textbf{Note:}\)

the forget bias \(b_{(f)}\) should be initialized with large values so that initially \(f_t\simeq 1\) and the gating has no effect.
the weight \(f_t\) of the previous cell state, and the weight \(i_t\) of the full update are independent of each other. In particular, they can both be zero, resulting in a reset of the state.
Multi-layer LSTM: When several layers of LSTM are combined, the first layer takes as input the sequence \(x_t\) itself, while the next layer take as input the output state of the previous layer, the \(h_t\).

\(\LARGE\textbf{Notes:}\)

The specific form of these units prevents the gradient from vanishing, but it may still be excessively large on certain mini-batch.
The standard strategy to solve this issue is gradient norm clipping (Pascanu et al., 2013), which consists of re-scaling the [norm of the] gradient to a fixed threshold \(\delta\) when it is above by torch.nn.utils.clip_grad_norm:

\[\widetilde{\nabla f}=\frac{\nabla f}{\|\nabla f\|} \min (\|\nabla f\|, \delta) \]

12-3 word-embeddings-and-translation

The geometry after embedding should account for synonymy, but also for identical word classes, etc. E.g. we would like such an embedding to make “cat” and “tiger” close, but also “red” and “blue”, or “eat” and “work”, etc

A common word embedding is the Continuous Bag of Words (CBOW) version of word2vec:

In this model, the embedding vectors are chosen so that a word can be [linearly] predicted from the sum of the embeddings of words around it

\(\Large\textbf{Details:}\)

Embedding vectors: \(E_k\in\mathbb{R}^D\) are optimized jointly with an array: \(M\in\mathbb{R}^{W\times D}\). So that the vectors of scores: \(\psi(t) = M\sum_{k\in C_t}E_k\in \mathbb{R}^W\), is a good predictor of the value of \(k_t\), where \(C_t = \{k_{t-l},...,k_{t-1},k_{t+1},...,k_{t+l} \}\) is the context around \(k_t\).
Ideally we would minimize the cross-entropy between the vector of scores \(\psi(t)\in \mathbb{R}^W\) and the class \(k_t\):

\[\sum_{t}-\log \left(\frac{\exp \psi(t)_{k_t}}{\sum_{k=1}^{W} \exp \psi(t)_{k}}\right) \]

However, given the vocabulary size, doing so is numerically unstable and computationally demanding.

Therefore recommand Negative Sampling: uses the prediction for the correct class \(k_t\) and only \(Q<< W\) incorrect classes \(\lambda_{t,1},...,\lambda_{t,Q}\) sampled at random:

\[loss = \sum_{t}\left(\log \left(1+e^{-\psi(t)_{k_{t}}}\right)+\sum_{q=1}^{Q} \log \left(1+e^{\psi(t)_{\lambda_{t, q}}}\right)\right) \]

Regarding the loss, we can use nn.BCEWithLogitsLoss which implements:

\[\sum_{t} y_{t} \log \left(1+\exp \left(-x_{t}\right)\right)+\left(1-y_{t}\right) \log \left(1+\exp \left(x_{t}\right)\right) \]

13-1 Attention for Memory and Sequence Translation

Attention-based processing: to transport information from parts of the signal to other parts dynamically identified

\(\Large\textbf{Notes:}\) Attention mechanisms aggregate features with an importance score that

depends on the feature themselves, not on their positions in the tensor,
relax locality constraints.

Neural Turing Machine

Graves et al. (2014) proposed to equip a deep model with an explicit memory to allow for long-term storage and retrieval.

\(\Large\textbf{Notes:}\)

The said module has an hidden internal state:

\[M_t\in \mathbb{R}^{N\times M} \]

where \(t\) is the time step, \(N\) is the number of entries in the memory and \(M\) is their dimension.

Reading: where given attention weights \(w_t\in\mathbb{R}_+^N,\sum_n w_t(n) = 1\), it gets:

\[r_t = \sum_n w_t(n)M_t(n) \]

Writing: an erase vector \(e_t\in[0,1]^M\), an add vector \(a_t\in\mathbb{R}^M\), the memory is updated by:

\[\forall n, M_{t}(n)=M_{t-1}(n)\left(1-w_{t}(n) e_{t}\right)+w_{t}(n) a_{t} \]

Attention Mechanisms

Simplest way: content-based attention. The attention is given by:

\[a: \mathbb{R}^{D'}\times \mathbb{R}^D\rightarrow \mathbb{R} \]

model params: \(\theta\in \mathbb{R}^{T\times D}\), this operation takes a “value” tensor as input: \(V\in \mathbb{R}^{T'\times D'}\), and computes an output: \(Y\in\mathbb{R}^{T\times D'}\)

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \frac{\exp \left(a\left(V_{i} ; \theta_{j}\right)\right)}{\sum_{k=1}^{T} \exp \left(a\left(V_{k} ; \theta_{j}\right)\right)} V_{i} \\ &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(V_{i} ; \theta_{j}\right)\right) V_{i} \end{aligned} \]

Context Attention: context tensor: \(C\in \mathbb{R}^{T\times D}\)

\[\forall j=1, \ldots, T, Y_{j}=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(C_{j}, V_{i} ; \theta\right)\right) V_{i} \]

标签：mathbb,layers,right,mathbf,Review,Deep,mu,Learning,left
来源： https://www.cnblogs.com/xinyu04/p/16402074.html