其他分享
首页 > 其他分享> > Deep Learning Week8 Notes

Deep Learning Week8 Notes

作者:互联网

1. Computer Vision Task

\(\\\)

\(\textbf{ROC}:\) The ROC curve shows the true positive rate as a function of the false positive rate. Each position of the ROC corresponds to a \(\textbf{threshold}\), not shown, which is the value above
which the sample is predicted to be of class 1:

Object Detection

Predicted bounding box \(\hat{B}\), annotated bounding box \(B\). If the \(\textbf{Intersection over Union}\) (IoU) is large enough:

\[\frac{area(B\cap \hat{B})}{area(B\cup \hat{B})}\geq \frac{1}{2} \]

then we would consider it's correct.

Image segmentation

consists of labeling individual pixels with the class of the object it belongs to, and may also involve predicting the instance it belongs to.
\(\textbf{Segmentation Accuracy (SA)}\): for class \(c\) is defined as:

\[SA = \frac{N_{Y=c,\hat{Y}=c}}{N_{Y=c,\hat{Y}=c}+N_{Y\neq c,\hat{Y}=c}+N_{Y=c,\hat{Y}\neq c}} \]

where \(N\) means the number.

2. Networks for image classification

Standard model: LeNet family. They share a common structure of several convolutional layers seen as features extractor, followed by fully connected layers as the classifier.

For example, \(\textbf{AlexNet}\):

import torchvision
alexnet = torchvision.models.alexnet()

\(\textbf{LeNet5}\): \(10\) classes, input \(1\times 28\times 28\):

(features): Sequential (
(0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(1): ReLU (inplace)
(2): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(4): ReLU (inplace)
(5): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
)

(classifier): Sequential (
(0): Linear (256 -> 120)
(1): ReLU (inplace)
(2): Linear (120 -> 84)
(3): ReLU (inplace)
(4): Linear (84 -> 10)
)

\(\textbf{AlexNet}\): \(1,000\) classes, input \(3\times 244\times 244\).

(features): Sequential (
(0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(1): ReLU (inplace)
(2): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
(3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(4): ReLU (inplace)
(5): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
(6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU (inplace)
(8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU (inplace)
(10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU (inplace)
(12): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
)

(classifier): Sequential (
(0): Dropout (p = 0.5)
(1): Linear (9216 -> 4096)
(2): ReLU (inplace)
(3): Dropout (p = 0.5)
(4): Linear (4096 -> 4096)
(5): ReLU (inplace)
(6): Linear (4096 -> 1000)
)

\(\text{Data Augmentation}\) to reduce over-fitting:

During test the prediction is averaged over five random crops and their horizontal reflections.

\(\text{Example:}\) Pre-trained models on image-classification problem.

import PIL, torch, torchvision
# Load and normalize the image
to_tensor = torchvision.transforms.ToTensor()
img = to_tensor(PIL.Image.open('../example_images/blacklab.jpg'))

img = img.unsqueeze(0) # (batch_size, C, H, W)
img = 0.5 + 0.5 * (img - img.mean()) / img.std()

# Load and evaluate the network
alexnet = torchvision.models.alexnet(pretrained = True)
alexnet.eval() # due to dropout effect
output = alexnet(img) # (1, 1000)

# Prints the classes
scores, indexes = output.view(-1).sort(descending = True)
class_names = eval(open('imagenet1000_clsid_to_human.txt', 'r').read())
for k in range(12):
    print(f'#{k+1} {scores[k].item():.02f} {class_names[indexes[k].item()]}')

Fully convolutional networks

See Lecture from P\(14\), StackExchange

Standard convolutional networks reshape the tensor \(x^{(l)}\) produced by convolution layers into \(1d\) tensors before feeding into fully connected layers composing the classifiers of the model.

\(\textbf{Conversely:}\) we can replace the fullt connexted layers by convolution layers whose filters are as big as the input tensors.

\(\text{Code:}\)

def convolutionize(layers, input_size):
  result_layers = []
  x = torch.zeros((1, ) + input_size)

  for m in layers:
    if isinstance(m, torch.nn.Linear):
        n = torch.nn.Conv2d(in_channels = x.size(1),
                            out_channels = m.weight.size(0),
                            kernel_size = (x.size(2), x.size(3)))
        
        with torch.no_grad():
            n.weight.view(-1).copy_(m.weight.view(-1))
            n.bias.view(-1).copy_(m.bias.view(-1))
        m = n
    
    result_layers.append(m)
    x = m(x)
  return result_layers

3. Networks for object detection

While image classification aims at predicting the class of the main object in the image, object detection aims at not only predicting the classes of all the objects which are visible, but also their locations.

\(\large\text{Overfeat}:\) adding a regression part to predict the object's bounding box. (See Lecture-P3)

In the single-object case, the convolutional layers are \(\textbf{frozen}\), and the localization layers are trained with a \(L_2\) loss.

For multiple boxes, using class-specific localization layers did not provide better results than having a \(\textbf{single one shared}\) across classes.

\(\\\)
\(\large\text{One of the most famous algorithm: } \textbf{YOLO (You Only Look Once)}\). It comes back to a classical architecture with a series of convolutional layers followed by a few fully connected layers.

In detail, it uses leaky ReLU, and its convolutional layers make use of the \(1 × 1\) bottleneck filters (Lin et al., 2013) to control the memory footprint and computational cost.

Illustration: Lecture-P8

During training, YOLO makes the assumption that any of the \(S^2\) cells contains at most a single object.
For every image, cell index \(i=1,...,S^2\), predicted box index \(j=1,...,B\), class index \(c = 1,...,C\).

then minimize:

\[\begin{align} &\lambda_{coord}\sum_{i=1}^{S^2}\sum_{j=1}^B1_{i,j}^{obj}[(x_i-\hat{x}_{i,j})^2+(y_i-\hat{y}_{i,j})^2+(\sqrt{w_i}-\sqrt{\hat{w}_{i,j}})^2+(\sqrt{h_i}-\sqrt{\hat{h}_{i,j}})^2]\\ &+\lambda_{obj}\sum_{i=1}^{S^2}\sum_{j=1}^B1_{i,j}^{obj}(c_{i,j}-\hat{c}_{i,j})^2+\lambda_{noobj}\sum_{i=1}^{S^2}\sum_{j=1}^B(1-1_{i,j}^{obj})\hat{c}_{i,j}^2\\ &+\lambda_{classes}\sum_{i=1}^{S^2}1_{i}^{obj}\sum_{c=1}^C(p_{i,c}-\hat{p}_{i,c})^2 \end{align} \]

The first part of the loss aims at minimizing the localization error of the detection. The square
root is used to reduce the weight of the height and width (\(w_i, h_i\)) over the corner location(\(x_i, y_i\)).

The second part of the loss estimates the confidence of a detection \(\hat{c}_{i,j}\) to reflect the intersection over union \(c_{i,j}\) of that bounding box with the ground truth. When there is no object, we want the confidence \(\hat{c}_{i,j}\) of that bounding box to be low, this part being driven by \(\lambda_{noobj}\).

The last part of the loss is for the class score. Note that while a natural choice is the cross-entropy, it is here a quadratic error.

\(\large\textbf{Tricks for training:}\) Lecture-P13

\(\large\textbf{Summarize: how 'one shot' can be achieved}\)

4. Networks for semantic segmentation

The deep-learning approach re-casts semantic segmentation as pixel classification, and re-uses networks trained for image classification by making them fully convolutional.

The added “\(\textbf{background}\)” class is added for pixels that do not belong to any of the defined object and avoid forcing the network to make a inconsistent choice.

Since segmentation aims at classifying the individual pixels, the size of the final tensor should be of the same size as the input image. Since the activation maps have been reduced by pooling operations, the size has to be increase back.

5. DataLoader

torch.utils.data.DataLoader

train_transforms = transforms.Compose(
      [
          transforms.ToTensor(),
          transforms.Normalize(mean = (0.1302,), std = (0.3069, ))
      ]
)

train_loader = DataLoader(
      datasets.MNIST(root = data_dir, train = True, download = True,
      transform = train_transforms),
      batch_size = 100,
      num_workers = 4,
      shuffle = True,
      pin_memory = torch.cuda.is_available()
)

num_workers: is the number of treads used by the CPU to load and prepare the mini-batch.

pin_memory: is useful when training on the GPU. This allocates the samples on a page-locked memory which speeds up the transfer between CPU and GPU.

\(\large\text{Example:}\)

data_dir = os.environ.get('PYTORCH_DATA_DIR') or './data/cifar10/'

num_workers = 4
batch_size = 64

transform = torchvision.transforms.ToTensor()

train_set = datasets.CIFAR10(root = data_dir, train = True,
download = True, transform = transform)

train_loader = utils.data.DataLoader(train_set, batch_size = batch_size,
shuffle = True, num_workers = num_workers)

test_set = datasets.CIFAR10(root = data_dir, train = False,
download = True, transform = transform)

test_loader = utils.data.DataLoader(test_set, batch_size = batch_size,
shuffle = False, num_workers = num_workers)

class ResBlock(nn.Module):
  def __init__(self, nb_channels, kernel_size):
      super().__init__()
      
      self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
                              padding = (kernel_size-1)//2)
      
      self.bn1 = nn.BatchNorm2d(nb_channels)
      
      self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
                            padding = (kernel_size-1)//2)
      
      self.bn2 = nn.BatchNorm2d(nb_channels)
  
  def forward(self, x):
      y = self.bn1(self.conv1(x))
      y = F.relu(y)
      y = self.bn2(self.conv2(y))
      y += x
      y = F.relu(y)
      return y

class Monster(nn.Module):
      def __init__(self, nb_blocks, nb_channels):
          super().__init__()
          
          alexnet = torchvision.models.alexnet(pretrained = True)

          self.features = nn.Sequential(alexnet.features[0], nn.ReLU(inplace = True))
          
          dummy = self.features(torch.zeros(1, 3, 32, 32)).size()
          alexnet_nb_channels = dummy[1]
          alexnet_map_size = tuple(dummy[2:4])
          
          self.conv = nn.Conv2d(alexnet_nb_channels, nb_channels, kernel_size = 1)
          
          self.resblocks = nn.Sequential(
              *(ResBlock(nb_channels, kernel_size = 3) for _ in range(nb_blocks))
          )

          self.avg = nn.AvgPool2d(kernel_size = alexnet_map_size)
          self.fc = nn.Linear(nb_channels, 10)
      
      def forward(self, x):
          x = self.features(x)
          x = F.relu(self.conv(x))
          x = self.resblocks(x)
          x = F.relu(self.avg(x))
          x = x.view(x.size(0), -1)
          x = self.fc(x)
          return x

nb_epochs = 50
nb_blocks, nb_channels = 8, 64

model, criterion = Monster(nb_blocks, nb_channels), nn.CrossEntropyLoss()

model.to(device)
criterion.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)

for e in range(nb_epochs):
# Freeze the features during half of the epochs
    for p in model.features.parameters():
        p.requires_grad = e >= nb_epochs // 2
    
    acc_loss = 0.0
    
    for input, targets in iter(train_loader):
        input, targets = input.to(device), targets.to(device)
        output = model(input)
        loss = criterion(output, targets)
        acc_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(e, acc_loss)

nb_test_errors, nb_test_samples = 0, 0

model.eval()

for input, targets in iter(test_loader):
    input, targets = input.to(device), targets.to(device)
    output = model(input)
    
    wta = torch.argmax(output.data, 1).view(-1)
    
    for i in range(targets.size(0)):
        nb_test_samples += 1
        if wta[i] != targets[i]: nb_test_errors += 1

test_error = 100 * nb_test_errors / nb_test_samples
print(f'test_error {test_error:.02f}% ({nb_test_errors}/{nb_test_samples})')

标签:layers,channels,Notes,textbf,self,Week8,Learning,nb,size
来源: https://www.cnblogs.com/xinyu04/p/16332984.html