Deep Learning Week10 Notes
作者:互联网
1. Auto-Regression
Auto-regression methods model components of a signal serially, each one conditionally to the ones already modeled.
They rely on the chain rule:
\[\begin{align} P(X_1 = x_1,...,X_T= x_T) = P(X_1 = x_1)P(X_2=x_2|X_1=x_1)...P(X_T|X_{T-1},...,X_1) \end{align} \]with two tensors of dimension \(T\): the first a Boolean mask stating which variables are conditioned, and the second the actual conditioning values.
Now we consider finite distributions over \(C\) real values. Hence we can model a conditional distribution with a mapping that maps a pair mask / known values to a distribution for the next value of the sequence:
\[f:\{ 0,1 \}^Q\times \mathbb{R}^Q\rightarrow\mathbb{R}^C \]where the \(C\) output values can be either probabilities, or as we will prefer, logits
\(\Large\text{Note:}\)
- In math: logits:
- In ML:
the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
Given such a model and a sampling procedure \(\text{sample}\), the generative process can be written as:
\[\begin{align} x_1&\leftarrow \text{sample}(f(\{\}))\\ x_2&\leftarrow \text{sample}(f(\{X_1=x_1\}))\\ &...\\ x_T&\leftarrow \text{sample}(f(\{X_1=x_1,X_2=x_2,...,X_{T-1} =x_{T-1}\})) \end{align} \]A sampling procedure takes as input the probabilities (or \(\text{logits}\)) output by the model (a tensor in \(\mathbb{R}^C\)) and outputs a value sampled randomly according to the provided probabilities or logits.
\(\text{Details: see }\)Lecture-P6
torch.distributions
:
>>> l = torch.tensor([ log(0.8), log(0.1), log(0.1) ])
>>> dist = torch.distributions.categorical.Categorical(logits = l)
>>> s = dist.sample((10000,))
>>> (s.view(-1, 1) == torch.arange(3).view(1, -1)).float().mean(0)
tensor([0.8037, 0.0988, 0.0975])
This can also be done in a batch:
>>> l = torch.tensor([[ log(0.90), log(0.10) ],
... [ log(0.50), log(0.50) ],
... [ log(0.25), log(0.75) ],
... [ log(0.01), log(0.99) ]])
>>> dist = torch.distributions.categorical.Categorical(logits = l)
>>> dist.sample((8,))
tensor([[0, 1, 1, 1],
[0, 1, 1, 1],
[0, 0, 1, 1],
[0, 1, 0, 1],
[1, 0, 1, 1],
[0, 1, 1, 1],
[0, 1, 1, 1],
[0, 0, 1, 1]])
In the batch case, the sampler is parameterized by a tensor of size
\[M_1\times M_2\times...\times M_K\times C \]that represents
\[M_1\times M_2\times...\times M_K \]vectors of logits over \(C\) classes.
The sampling itself takes \((N_1,...,N_L)\) as input and returns a tensor of size
\[N_1\times N_2\times...\times N_L\times M_1\times...\times M_K \]of values in \(\{0,1,...,C-1 \}\)
\(\\\)
\(\text{Minimize the loss:}\)
where \(l\) is the cross-entropy loss.
In training practice: refer Lecture P9-15
Image autogression
MNIST samples are \(28 × 28\) gray-scale images. Pixels are in \([0, 255]\). For auto-regression, such a \(28 × 28\) image will be interpreted as a sequence of length \(784\), corresponding to the pixels visited from top to bottom, and from left to right.
Define two functions to serialize the image tensors into sequences:
def seq2tensor(s):
return s.reshape(-1, 1, 28, 28)
def tensor2seq(s):
return s.reshape(-1, 28 * 28)
whole training model process: Lecture P20-22
2. Causal convolution
Instead of predicting [the distribution of] one component, the model could make a prediction at every position of the sequence, that is
\[f:\mathbb{R}^T\rightarrow\mathbb{R}^{T\times C} \]In detail:
\[\begin{aligned} x_{1} & \leftarrow \text { sample }\left(f_{1}(0, \ldots, 0)\right) \\ x_{2} & \leftarrow \text { sample }\left(f_{2}\left(x_{1}, 0, \ldots, 0\right)\right) \\ x_{3} & \leftarrow \operatorname{sample}\left(f_{3}\left(x_{1}, x_{2}, 0, \ldots, 0\right)\right) \\ & \ldots \\ x_{T} & \leftarrow \text { sample }\left(f_{T}\left(x_{1}, x_{2}, \ldots, x_{T-1}, 0\right)\right) \end{aligned} \]where the \(0\)s simply fill in for unknown values, and the mask is not needed.
If additionally, the model is such that “future values” do not influence the prediction at a certain time, that is
\[\begin{aligned} \forall t, x_{1}, \ldots, x_{t}, \alpha_{1}, \ldots, \alpha_{T-t}, \beta_{1}, \ldots, \beta_{T-t} \\ & f_{t+1}\left(x_{1}, \ldots, x_{t}, \alpha_{1}, \ldots, \alpha_{T-t}\right)=f_{t+1}\left(x_{1}, \ldots, x_{t}, \beta_{1}, \ldots, \beta_{T-t}\right) \end{aligned} \]then in particular:
\[\begin{aligned} f_{1}(0, \ldots, 0) &=f_{1}\left(x_{1}, \ldots, x_{T}\right) \\ f_{2}\left(x_{1}, 0, \ldots, 0\right) &=f_{2}\left(x_{1}, \ldots, x_{T}\right) \\ f_{3}\left(x_{1}, x_{2}, 0, \ldots, 0\right) &=f_{3}\left(x_{1}, \ldots, x_{T}\right) \\ & \cdots \\ f_{T}\left(x_{1}, x_{2}, \ldots, x_{T-1}, 0\right) &=f_{T}\left(x_{1}, \ldots, x_{T}\right) \end{aligned} \]Which provides a tremendous computational advantage during training, since
\[\begin{aligned} \ell(f, x) &=\sum_{t} \ell\left(f_{t}\left(x_{1}, \ldots, x_{t-1}, 0, \ldots, 0\right), x_{t}\right) \\ &=\sum_{t} \ell(\underbrace{f_{t}\left(x_{1}, \ldots, x_{T}\right)}_{f \text { is computed once }}, x_{t}) . \end{aligned} \]\(\large\text{More details and illustrations, see }\) Lecture.
3. Non-volume preserving networks
原论文:NVP Networks
相关博客:Blog
Given a dimension \(d\), a Boolean vector \(b \in \{0, 1\}^d\) and two mappings:
\[\begin{align} s &: \mathbb{R^d}\rightarrow\mathbb{R^d}\\ t &: \mathbb{R^d}\rightarrow\mathbb{R^d} \end{align} \]define a [fully connected] coupling layer as the transformation:
\[\begin{align} c: \mathbb{R^d}&\rightarrow \mathbb{R^d}\\ x&\rightarrow b \odot x+(1-b) \odot(x \odot \exp (s(b \odot x))+t(b \odot x)) \end{align} \]where \(\text{exp}\) is component-wise, and \(\odot\) is the Hadamard component-wise product. The quantities \(t\) and \(s\) stand respectively for translation and scale.
For clarity in what follows, \(b\) has all \(1\)s first, follows by \(0\)s, but this is not required:
\[b=(\underbrace{1,1, \ldots, 1}_{\Delta}, \underbrace{0,0, \ldots, 0}_{d-\Delta}) \]\(\large\text{Illustration: }\) Lecture-P14
The second property of this mapping is the simplicity of its Jacobian: see Lecture-P16
and we have
\(\\\)
\(\text{Code:}\)
dim = 6
x = torch.randn(1, dim).requires_grad_()
b = torch.zeros(1, dim)
b[:, :dim//2] = 1.0
s = nn.Sequential(nn.Linear(dim, dim), nn.Tanh())
t = nn.Sequential(nn.Linear(dim, dim), nn.Tanh())
c = b * x + (1 - b) * (x * torch.exp(s(b * x)) + t(b * x))
# Flexing a bit
j = torch.cat([autograd.grad(c_k, x, retain_graph=True)[0] for c_k in c[0]])
print(j)
prints
tensor([[ 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000],
[ 0.4001, -0.3774, -0.9410, 1.0074, 0.0000, 0.0000],
[-0.1756, 0.0409, 0.0808, 0.0000, 1.2412, 0.0000],
[ 0.0875, -0.3724, -0.1542, 0.0000, 0.0000, 0.6186]])
To recap, with \(f^{(k)},k=1,2,...,K\) coupling layers:
\[\begin{align} f = f^{(K)}\circ ... \circ f^{(1)} \end{align} \]and \(x_n^{(0)} = x_n, x_n^{(k)} = f^{(k)}(x_n^{(k-1)})\), we train by minimizing
\[\mathscr{L}(f)=-\sum_{n}-\frac{1}{2}\left(\left\|x_{n}^{(K)}\right\|^{2}+d \log 2 \pi\right)+\sum_{k=1}^{K} \log \left|J_{f(k)}\left(x_{n}^{(k-1)}\right)\right| \]with
\[\log \left|J_{f(k)}(x)\right|=\sum_{i}\left(\left(1-b^{(k)}\right) \odot s^{(k)}\left(x \odot b^{(k)}\right)\right)_{i} \]A coupling layer can be implemented with:
class NVPCouplingLayer(nn.Module):
def __init__(self, map_s, map_t, b):
super().__init__()
self.map_s = map_s
self.map_t = map_t
self.register_buffer('b', b.unsqueeze(0))
def forward(self, x, ldj): # ldj for log det Jacobian
s, t = self.map_s(self.b * x), self.map_t(self.b * x)
ldj = ldj + ((1 - self.b) * s).sum(1)
y = self.b * x + (1 - self.b) * (torch.exp(s) * x + t)
return y, ldj
def invert(self, y):
s, t = self.map_s(self.b * y), self.map_t(self.b * y)
return self.b * y + (1 - self.b) * (torch.exp(-s) * (y - t))
We can then define a complete network with one-hidden layer tanh MLPs for the \(s\) and \(t\) mappings:
class NVPNet(nn.Module):
def __init__(self, dim, hidden_dim, depth):
super().__init__()
b = torch.empty(dim)
self.layers = nn.ModuleList()
for d in range(depth):
if d%2 == 0:
i = torch.randperm(b.numel())[0:b.numel() // 2]
b.zero_()[i] = 1
else:
b = 1 - b
map_s = nn.Sequential(nn.Linear(dim,hidden_dim), nn.Tanh(),
nn.Linear(hidden_dim, dim))
map_t = nn.Sequential(nn.Linear(dim, hidden_dim), nn.Tanh(),
nn.Linear(hidden_dim, dim))
self.layers.append(NVPCouplingLayer(map_s, map_t, b.clone()))
def forward(self, x, ldj):
for m in self.layers: x, ldj = m(x, ldj)
return x, ldj
def invert(self, y):
for m in reversed(self.layers): y = m.invert(y)
return y
torch.randperm(n)
: Returns a random permutation of integers from \(0\) to \(n - 1\)..numel(input)
: Returns the total number of elements in theinput
tensor.
And the log-proba of individual samples of a batch:
def LogProba(x, ldj):
log_p = - 0.5 * (x**2 + math.log(2*pi)).sum(1) + ldj
return log_p
Training is achieved by maximizing the mean log-proba:
batch_size = 100
model = NVPNet(dim = 2, hidden_dim = 2, depth = 4)
optimizer = optim.Adam(model.parameters(), lr = 1e-2)
for e in range(args.nb_epochs):
for input in train_input.split(batch_size):
output, ldj = model(input, 0)
loss = - LogProba(output, ldj).mean()
model.zero_grad()
loss.backward()
optimizer.step()
Finally, we can sample according to \(\mu_X\) with
z = torch.randn(nb_generated_samples, 2)
x = model.invert(z)
标签:Week10,right,log,dim,Notes,ldots,Learning,self,left 来源: https://www.cnblogs.com/xinyu04/p/16341184.html