Exploring Cross-Image Pixel Contrast for Semantic Segmentation

代码地址:https://github.com/tfzhou/ContrastiveSeg

Main idea. Current segmentation models learn to map pixels (b) to an embedding space (c), yet ignoring intrinsic structures of labeled data (i.e., inter-image relations among pixels from a same class, noted with same color in(b)). Pixel-wise contrastive learning is introduced to foster a new training paradigm (d), by explicitly addressing intra-class compactness and inter-class dispersion. Each pixel (embedding) i is pulled closer to pixels of the same class, but pushed far from pixels from other classes. Thus a better-structured embedding space (e) is de- rived, eventually boosting the performance of segmentation models.

Current semantic segmentation methods focus only on mining “local” context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure- aware optimization criteria (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frame- works without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3,HRNet,OCR) and backbones(i.e., ResNet, HR- Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL- Context, COCO-Stuff). We expect this work will encourage our community to rethink the current defacto training paradigm in fully supervised semantic segmentation1.

当前的语义分割模型关注挖掘局部上下文,例如:单个图像中像素之间的依赖,或者结构-感知的优化策略(IoU-like loss)。然而,他们忽略了训练数据中的全局上下文,例如,不同图像中限速之间的语义关系。

本文提出了一种全监督条件下的像素级别的对比算法用于语义分割。核心思想是强迫属于统一个语义类别的像素embeddings更为相似。它提出了一个像素级别的度量学习范式,通过显式地探索标记像素的结构来实现的。

提出的发那个发可以不费力气地融入到现有的分割框架,但在测试阶段没有额外的开销。

发表评论

邮箱地址不会被公开。 必填项已用*标注