Segformer B0 Demonstrates Powerful and Lightweight Multitask Performance

With inspiring efficiency, the Segformer B0 gets competitive scores on joint semantic segmentation and depth estimation.

Posted by : natecibik on Sep 27, 2023

The application of transformers to computer vision tasks is a relatively recent phenomenon [1], but is now a rapidly developing area of research likely to continue gaining popularity thanks to compelling performance, improving architectures, and computational innovations which avoid the $O(n^2)$ complexity of conventional self-attention [2, 3]. Self-attention enables transformers to capture global dependencies in a way that CNNs traditionally lack, as evidenced by previous efforts to incorporate self-attention into CNNs [4, 5] or increase their receptive field through dilation [6, 7]. This makes transformers seem like a natural choice for vision tasks in domains where global context carries a lot of semantic and geometric meaning, such as in autonomous navigation, and perhaps soon they will challenge the dominance of CNNs in perception stacks.

Segformer may be seen as a harbinger of this revolution. Demonstrating impressive performance in semantic segmentation with lightweight designs in both the encoder and decoder which simultaneously improve model efficiency and flexibility, and utilizing the efficient self-attention mechanism from [3] which reduces self-attention complexity to $O(\frac{n^2}{R})$, Segformer is an attractive option for semantic segmentation that provides the benefits of global dependencies and contextual awareness offered by vision transformers while avoiding their previous weaknesses, and even offers a range of six model sizes (B0-B5) to choose from based on the application and desired accuracy/efficiency balance.

Making things even more exciting, the Global-Local Path Network (GLPN) authors showed that the same hierarchical transformer encoder structure from Segformer could be combined with an efficient decoding head for monocular depth estimation to produce models with head-turning performance and generalization ability. This means it is a relatively simple surgery to construct a multi-task Segformer model which can simultaneously predict semantic segmentation and depth, so I set about doing this for my recent work with CARLA and the SHIFT dataset, and the results are impressive.

Since one of the objectives of the project is to keep things as light as possible, I chose the B0 (smallest) version of the Segformer architecture, which has only 3.7 million parameters. First, it was trained for 95,000 training steps (5 epochs) on semantic segmentation only with a linear learning rate schedule from 6e-5 to zero using the front camera of the SHIFT dataset’s discrete/images training set. Then, the GLPN depth head was added and the model was trained for another 5 epochs on both tasks with a starting learning rate of 5e-5, showing mutual performance increase in both tasks. After further fine-tuning, the final evaluation scores were a mean IoU of 0.828 and a SiLog loss of 3.07. For a full breakdown of the training process and results, refer to the full Weights & Biases report attached to the bottom of this post. The code is available on GitHub.

Below are videos of the model performing in various driving conditions offered by the dataset. We can see that in the easier daytime example, the performance is very strong with little error, but as the operational domain gets more challenging with less light and more adverse weather, the performance decreases. However, even in the most challenging settings where the video is difficult to parse even for the human eye, the model is mostly able to capture the scene structure, which can likely be attributed in part to the spatial reasoning capabilities of transformers. Keeping in mind that there are just 4.07M parameters jointly estimating two tasks on full resolution images, these results look quite good, and using a larger Segformer backbone would lead to even better results.