Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)?

We systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.


Challenge 1: efficient multi-scale feature fusion
Since these different input features are at different resolutions, we observe they usually contribute to the fused output feature unequally. we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features
【CC】观察发现不同尺度特征对最后的输出贡献是不一样的,基于这点设计一个权重可学习的双向金字塔结构用于特征融合;用MLP做weight的学些是不是也可以? 同理,用self-attention是不是也可以可以?已经有人这么干了

Challenge 2: model scaling
Recently, [36] demonstrates remarkable model efficiency for image classification by jointly scaling up network
width, depth, and resolution.We observe that scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and effificiency. we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
【CC】其实是根据前人研究:将backbone/header/resolution 合起来缩放对最终精度有比较大的影响;基于这个思想作者对efficientnet+bifpn+header+resolution 进行不同尺度的缩放,形成了自己的一个网络族叫做efficientDet


Multi-scale feature fusion aims to aggregate features at different resolutions. Formally, given a list of multi-scale
features Pin = (P-in-l1 , P-in-l2 , …), where P-in-li represents the feature at level li, our goal is to find a transformation f that can effectively aggregate different features and output a list of new features: Pout = f(Pin).
【CC】形式化描述FPN融合问题:给定一组多尺度的特征 Pin = (P-in-l1 , P-in-l2 , …)找到一个高效的函数f 使得Pout = f(Pin)

This paper proposes several optimizations for cross-scale connections:
First, we remove those nodes that only have one input edge. Our intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features.

Second, we add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost

Third, unlike PANet [23] that only has one top-down and one bottom-up path, we treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion
【CC】双向连接本身连接的思路跟PANet非常像,作者强调的是将BiFPN Block向后叠加多层这样来做多尺度的特征融合

When fusing features with different resolutions, a common way is to first resize them to the same resolution and then sum them up.
we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input, and let the network to learn the importance of each input feature.
【CC】一般不同尺度特征融合方式:将其他尺度的特征通过上/下采样转换到当前尺度,然后相加; 因为不同尺度的特征对最后ouput的贡献是不一样的,所以加入自学习权重来提升性能

Unbounded fusion:
where wi is a learnable weight that can be a scalar. However, since the scalar weight is unbounded, it could potentially cause training instability. Therefore, we resort to weight normalization to bound the value range of each weight

Softmax-based fusion:
An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input。the extra softmax leads to signifificant slowdown on GPU hardware

Fast normalized fusion:
where wi ≥ 0 is ensured by applying a Relu after each w i, and τ = 0.0001 is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more effificient.

Our fifinal BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.
As a concrete example, here we describe the two fused features at level 6 for BiFPN shown in Figure 2(d):
where P6-td is the intermediate feature at level 6 on the top-down pathway, and P6-out is the output feature at level 6 on the bottom-up pathway
【CC】这里举例说明P6的计算逻辑:P6-td是从左自右第一个蓝色节点计算后输出,计算方式维:本层的输入P6-in + 上层原始输入下采样的结果Resize(P7-in),即给到第二个蓝色节点右向下传递到P5层;第二个蓝色节点的输入:P6-td + 第5层的上采样结果 P5-out + 本层的跳跃连接P6-in


We employ ImageNet-pretrained EffificientNets as the backbone network. Our proposed BiFPN serves as the feature network, which takes level 3-7 features
{P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively.
Figure 3: EfficientDet architecture – It employs EfficientNet [36] as the backbone network, BiFPN as the feature network, and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table 1.

Compound Scaling
Aiming at optimizing both accuracy and effificiency, we would like to develop a family of models that can meet a wide spectrum of resource constraints.

we propose a new compound scaling method for object detection, which uses a simple compound coeffificient φ to jointly scale up all dimensions of backbone
network, BiFPN network, class/box network, and resolution.
【CC】φ作为缩放因子,会影响所有网络尺度:backbone+BiFPN+Header + resolution

  1. Backbone network

we reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
【CC】直接重用efficientNet backbone的网络族

  1. BiFPN network

Formally, BiFPN width and depth are scaled with the following equation, BiFPN depth Dbifpn (#layers), BiFPN width Wbifpn

  1. Box/class prediction network

we fix their width to be always the same as BiFPN (i.e., Wpred = Wbifpn), but linearly increase the depth (#layers) using equation

  1. Input image resolution

in BiFPN, the input resolution must be dividable by 27 =128, so we linearly increase resolutions using equation:
Notably, our scaling is heuristic-based and might not be optimal, but we will show that this simple scaling method can signifificantly improve effificiency than other single-dimension scaling method
Scaling configs for EfficientDet D0-D6 – φ is the compound coefficient that controls all other scaling dimensions; BiFPN, box/class net, and input size are scaled
up using equation 1, 2, 3 respectively.
【CC】就是根据公式1-3 总结出来的表格
Figure 1: Model FLOPs vs. COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves new state-of-the-art 52.2% COCO AP with much fewer parameters and FLOPs than previous detectors. More studies on different backbones and FPN/NAS-FPN/BiFPN are in Table 4 and 5. Complete results are in Table 2.
【CC】这是文章最开头放的对比数据,还是挺震撼的,跑的比人家快, 精度还比较高

