Swin Transformers: The most powerful tool in Computer Vision

7 min readSep 1, 2021

The title is catchy, but it is true(at least as of writing the article). Swin transformers demonstrated the potential as a game-changer in network architecture for many computer vision tasks including object detection, image classification, semantic segmentation, and potentially any vision task. Since the initial ViT(Vision Transformer) showed promising performance on image classification, it seemed that Transformers could easily replace CNNs in vision problems a couple of papers down the line.

However, adapting Transformers to fully supplement convolutions was a non-trivial challenge due to the huge quadratic-to-image-size computational complexity of self attention. As the authors quote:

Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.by effectively modeling the potential variations in the scale of objects and the high resolution of pixels in images. more efficiently and can serve as a general-purpose pipeline for vision.

In particular, the authors argue that compared to regular ViTs, Swin transformers are computationally very effective in dense prediction tasks such as semantic segmentation where pixel-level predictions are made. The paper describes Swin Transformers as a hierarchical Transformer whose representation is computed with Shifted WINdows. In this post, we will review the important concepts of Swin Transformers and see why it performs well at many tasks.

This paper suggests the following:

Hierarchical representation by starting from small-sized patches and gradually increasing the size through merging to achieve scale-invariance
Achieves efficient, linear computational complexity by computing self-attention locally. (shifted window approach)
Extensive experiments on various tasks and model architectures.

Original paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Why CNN is suited for in Vision

The paper observes the theoretical background of why CNNs are efficient for modeling domains within vision. Some disadvantages of Transformers for image data includes:

Unlike word tokens in NLP, visual elements have various scales. Especially in object detection.
Images are in much higher resolution and vision tasks that require pixel-level predictions are intractable for transformers because the computation complexity(+memory usage) of self-attention is quadratic to image size.

Network Architecture —Hierarchical representation

By taking an overview of the pipeline, we can find 4 unique building blocks in the diagram above. First, the input RGB image is split into patches by the patch partition layer. Each patch is 4 x 4 x 3(3 for RGB channels) and is considered a “token”. The patch is subject to a linear embedding layer which projects it to a C dimensional token as in ViT.

The main architecture is composed of multiple stages(4 stages for Swin-T) which again is built by connecting a patch merging layer and multiple Swin transformer blocks.

The Swin transformer block is based on a modified self-attention which we will review soon. The block is composed of multi-head self-attention (MSA), layer normalization(LN), and a 2-layer MLP. This attention-based transformer block serves as the computational backbone of the network. The number of tokens H/4 × W/4 is maintained throughout the transformer blocks.

A hierarchical representation is implemented through the patch merging layers. This layer concatenates the features of 2 × 2 neighboring patches which reduce the number of tokens and applies a linear transformation that sets the output dimension by a factor of 2(relative to the input). The feature map resolution increases as the network get deeper and the merging layer repeats.

In the context of CNNs, we can intuitively understand the merging layers as pooling and transformer blocks as convolution layers. This approach enables the network to detect objects of various scales efficiently.

Shifted Windows

The shifted windows approach is based on the observation that standard and vision Transformers conduct self-attention on a global receptive field. Therefore, vision transformers have quadratic computational complexity to the number of tokens. This limits applications such as semantic segmentation which require dense high-resolution predictions.

The shifted window aims to compute self-attention within local windows. A window contains M × M non-overlapping patches(M=7), and self-attention is calculated in the window. As a result, the computational complexity drops as following equation, where the original MSA is quadratic to patch number hw while window-based MSA is linear.

A shifted window partitioning approach that alternates between two configurations in consecutive Swin Transformer blocks is adopted to model the connection between windows while leveraging from the efficiency. As illustrated in the figure below, the first module uses standard window configuration to compute self-attention locally from evenly separated windows starting from the top-left pixel. The next Swin transformer block adopts a window configuration that is shifted by (M/2, M/2) pixels from the preceding layer.

During the Swin transformer blocks, the network alternates between standard window configuration(W-MSA) and shited window configuration(SW-MSA). This approach introduces connections between neighboring overlapping windows just like how deep convolutions work. We validate that shifted windows approach significantly improves the performance and efficiency of transformers in the table below.

According to the paper, feeding a relative position bias for computing self-attention is effective. The formula below which we can see the usual Q, K, V attention, describes the modified self-attention. We provide the relative position of the window as B. This provides significant improvements to the performance as described in the table below.

Efficient shifting

For efficient processing of edge windows smaller than M × M, the paper applies cyclic-shifting before computing self-attention as illustrated in the figure below. A masking mechanism is applied to the partitions so that computation is limited within each original window.

We validate how the shifted windows approach and this simple trick reduces the real-time latency of the network from Performer, one of the fastest Transformer architectures. Reducing inference time is critical because it can later be traded off with accuracy by using larger networks.

Experiments

The paper evaluates 4 of its model variants: Swin-T, Swin-S, Swin-B, and Swin-L (increasing scale in order) on image classification, object detection, and semantic segmentation. Details on hyperparameters, training configurations for each task, and model architecture variants are provided in the original paper. The table below provides a concise summary of the scale of Swin transformers compared to other ImageNet CNN/Transformers. The complexity of Swin-T and Swin-S are similar to ResNet-50 and ResNet101, respectively.

On regular ImageNet(table a above), Swin transformers outperform DieT transformers by 1.5% and a slightly improved speed-accuracy trade-off compared to RegNet, EfficientNet. The authors argue that these CNNs are obtained via network architecture search and the Swin transformers have even more potential for improvements.

Pretraining on ImageNet 22K before fine-tuning brings ~2% accuracy gains. We also observe that larger networks significantly contribute to performance.

The Swin transformer is tested as backbones for 4 object detection frameworks: Cascade Mask R-CNN, ATSS, RepPoints v2, and Sparse RCNN on COCO object detection dataset.

Table (a) compares the Swin-T model with ResNet-50. Table (b) compares Swin transformers with ResNext under different model capacities on Cascade Mask R-CNN. We can clearly observe that Swin transformers perform significantly better compared to previous CNNs and vision transformers as a backbone for object detection.

Since I am not familiar with semantic segmentation, I can’t say much about how the backbone is integrated with the framework. But it seems like Swin transformers outperform previous works and perform very efficiently on semantic segmentation.

Summary

Swin transformers are all about how to integrate the characteristic advantages of CNNs in vision with the efficient, powerful architecture of transformers. The paper suggests hierarchical representation to achieve scale-invariance and shifted windows approach to efficiently convey information within the local window.