Exploring Cross-Image Pixel Contrast for Semantic Segmentation


Main idea. Current segmentation models learn to map pixels (b) to an embedding space (c), yet ignoring intrinsic structures of labeled data (i.e., inter-image relations among pixels from a same class, noted with same color in(b)). Pixel-wise contrastive learning is introduced to foster a new training paradigm (d), by explicitly addressing intra-class compactness and inter-class dispersion. Each pixel (embedding) i is pulled closer to pixels of the same class, but pushed far from pixels from other classes. Thus a better-structured embedding space (e) is de- rived, eventually boosting the performance of segmentation models.

Current semantic segmentation methods focus only on mining “local” context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure- aware optimization criteria (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frame- works without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3,HRNet,OCR) and backbones(i.e., ResNet, HR- Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL- Context, COCO-Stuff). We expect this work will encourage our community to rethink the current defacto training paradigm in fully supervised semantic segmentation1.

当前的语义分割模型关注挖掘局部上下文,例如:单个图像中像素之间的依赖,或者结构-感知的优化策略(IoU-like loss)。然而,他们忽略了训练数据中的全局上下文,例如,不同图像中限速之间的语义关系。




邮箱地址不会被公开。 必填项已用*标注