Making VGG-style convnets great again with RepVGG

Published in

CodeX

8 min readOct 26, 2021

VGG is considered one of the cornerstones of deep CNNs. The authors of VGG provided widely accepted principles for designing CNNs. In particular, they suggest that multiple 3 × 3 convolutions are more efficient than using larger kernel sizes.

However, the VGG architecture, composed of nothing but successive stacks of 3 × 3 convolutions and ReLU could seem crude and weak when compared to complicated modern state-of-the-art CNN architectures, most of which involve complex convolution variants and neural architecture search(NAS). A recent paper suggests that using some techniques, the plain design can show comparable inference performance for image recognition.

RepVGG is an architecture that is designed like a multi-branch model(e.g. ResNet, Inception), but can be converted via structural re-parameterization to a VGG-like model with successive stacks of 3 × 3 convolutions and ReLU that yields the same results at inference. In this post, we will discuss the details of the proposed RepVGG architecture and why it can be useful.

The authors suggest

The advantages of simple architectures compared to certain components used in modern CNN architectures in inference.
A smart trick for designing multi-branch models that can be converted into
RepVGG, a multi-branch architecture that can be converted into a VGG-style architecture that outperforms many complicated models.

This paper is not merely a demonstration that plain models can converge reasonably well, and does not intend to train extremely deep ConvNets like ResNets. Rather, we aim to build a simple model with reasonable depth and favorable accuracy-speed trade-off, which can be simply implemented with the most common components (e.g., regular conv and BN) and simple algebra.

Original paper: RepVGG: Making VGG-style ConvNets Great Again

Possible advantages of VGG: fast, memory-economical and Flexible

While previous methods have proposed complicated and efficient network structures, they often don’t perform as expected in real devices. Specifically, the authors argues that:

Multi-branch designs such as skip connections in ResNets and branch concatenation in Inception networks are memory-inefficient because the results of every branch need to be kept until the addition or concatenation.
Some components, typically depthwise-convolution used in many efficient architectures(Xception, MobileNets, EfficientNetV2) and channel shuffle in ShuffleNets drastically increase the memory cost and aren’t fully optimized on hardware accelerators a.k.a GPUs. As described in the table below, many recent multi-branch architectures have lower theoretical FLOPs than VGG but may not run faster.

Speed is measured in samples/second on 1080ti, higher is faster

3. Multi-branch networks is less flexible and is hard to modify because certain architectural specification poses constraints when applying some techniques. For example, multi-branch topology limits the application of channel pruning. In contrast, a plain architecture allows us to freely configure every conv layer according to our requirements.

These can be some of the reasons why VGG and the original versions of ResNets are still heavily used for real-world applications in academia and industry, despite their lack of performance.

The second issue related to the weak GPU utilization of specific architectural designs and operations is an especially well-known issue. Moreover, 3 × 3 convolutions, which is the main operation of VGG is especially highly optimized by modern computing libraries like NVIDIA cuDNN and Intel MKL on GPU and CPU using the Winograd algorithm. The table below illustrates the practical advantages of computing time when using 3 × 3 convolutions.

Overview & Intuition

The intuition behind RepVGG is that the benefits of multi-branch networks are limited in training time, while the drawbacks related to inference speed are undesirable in test time.

However, the claim that

An explanation is that a multi-branch topology, e.g., ResNet, makes the model an implicit ensemble of numerous shallower models [36], so that training a multi-branch model avoids the gradient vanishing problem.
Since the benefits of multi-branch architecture are all for training and the drawbacks are undesired for inference,

seems questionable to me, especially where the authors argue that the benefit of multi-branch models are all for training despite suggesting only one case where it is useful for training. Still, the authors do have some convincing points considering that the multi-branch architecture described in the paper will likely perform faster after being converted into a plain architecture.

RepVGG is an architecture that is designed like a multi-branch model(e.g. ResNet, Inception), but can be converted via structural re-parameterization into a VGG-like model with successive stacks of 3 × 3 convolutions and ReLU which yields the same results at inference but is highly optimized by modern computing libraries. This design benefits from the advantages of multi-branch models in training and the advantages of plain models in inference.

[36] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems, pages 550–558, 2016

RepVGG Architecture

ResNets construct a shortcut and model the information flow as y = x + f(x), where f is the residual block that is learned. When the dimensions of x and f(x) don’t match because of strides, it is modeled as y = g(x)+f (x) where g is a 1 × 1 convolution. The building blocks of the training-time RepVGG network are constructed using similar shortcuts and 1 × 1 convolution, which results in the model y = x + g(x) + f(x). The architecture is illustrated in the figure above.

One difference is that the residual block in ResNets typically consists of 2 conv and ReLU layers while RepVGG uses only one layer each. Batch normalization is applied after the operation of each branch, thus the building block is precisely defined as y = ReLU( BN(x) + BN(g(x)) + BN(f(x)) ).

Structural Re-parameterization

Structural re-parameterization is a sequence of steps to convert a trained block into a single 3 × 3 conv layer for inference by exploiting the design characteristics of the building block of RepVGG.

The first step of structural re-parameterization is fusing batch normalization into the convolution parameters.

Applying batch normalization to the input M at inference time can be described as the equation above. Note that μ, σ, γ, β are each the accumulated mean, standard deviation and learned scaling factor and bias of the BN layer.

A Convolutional layer followed by batch normalization BN(M∗W) can be expressed into a conv with a bias vector as described in the equation below.

Specifically, W` and b` are defined as the equations below. We can easily prove that both sides of the equation above are equal given these parameters.

bn(M∗W, μ, σ, γ, β) = (M∗W —μ)×(γ / σ)+ β and

(M∗W`)+b`= (M∗W)×(γ / σ) — μ×(γ / σ)+β =(M∗W — μ)×(γ / σ)+ β

This transformation is also applied to the identity branch, which can be viewed as a 1×1 conv with some fixed kernel. Thus, we convert the parameters of every branch using the definitions of W` and b` to get one 3×3 kernel, two 1×1 kernels, and three bias vectors.

After fusing batch normalization operation by transforming the kernels, we add up the kernels into a single 3×3 conv kernel. This is done by first zero-padding the two 1×1 kernels to 3×3 and adding the three kernels up, as shown in (B) in the figure above. The result is a single 3×3 conv layer that implies three branches and batch normalization. Cool!

Details & Scaling

RepVGG is VGG-style in the sense that it adopts a plain topology and heavily uses 3×3 conv, but it does not use max-pooling as in the original VGG. In fact, the architecture and training configurations are very different, since the authors utilize more modern improvements and design principles. RepVGG is designed to follow the following principles:

The first stage operates with large resolution, so we use only one layer for lower latency.
The last stage shall have more channels, so we use only one layer to save the parameters.
We put the most layers into the second last stage, following ResNet and its recent variants.

The result is an architecture like the table above. While these hyper-parameters have room for improvement, they are a reasonable design similar to many deep CNNs. Additionally, groupwise convolutions with g=1, 2, 4 globally are used at odd number of layers(3rd, 5th, 7th, …21st, …) to accelerate training and inference.

The authors scale the width of the model by incrementing or decreasing the value of a and b and scale the depth of the model with two configurations RepVGG-A and RepVGG-B. Specifications of tricks and configurations used to further improve performance are described in the paper(e.g. learning rate decay, cosine annealing, data augmentation).

Experiments

The authors provide experiments on the performance of the RepVGG model in various settings. The results are “as expected” since the authors are basically train a ResNet variant, just that it is optimized to leverage benefits of a plain architecture in inference time. Thus, it is significantly better in terms of the speed-accuracy tradeoff.

In an ablation study, the authors show that a plain training-time model of RepVGG-B0 only yields 72.39% accuracy, while the two additional branches improve it to 75.14%.

The authors also compare their structural re-parameterization, which poses no harm to the final performance with similar variants on fusing branches via re-parameterization. Variants are not described in detail, but comparisons with the proposed structural re-parameterization(Full-featured reparam) demonstrate the effectiveness of RepVGG.

Discussion

The authors demonstrate the benefits of plain architectures as being fast, flexible, and memory-economical. To combine the benefits of both designs, the authors propose a network that can be trained as a multi-branch network but can be converted into a plain network for inference. Coupled with modern training configurations, the authors were able to demonstrate comparable speed-accuracy curves with a plain VGG-style network to popular modern CNNs.

I think of the RepVGG architecture as a compression method to accelerate ResNets by converting them into a plain architecture, which can be beneficial in many aspects. This is practically very different from many previous re-parameterization techniques which are often done at initialization or training time from that the method poses no negative impacts to the final performance.