作者归档:Jessica Chan

Exploring Cross-Image Pixel Contrast for Semantic Segmentation


Main idea. Current segmentation models learn to map pixels (b) to an embedding space (c), yet ignoring intrinsic structures of labeled data (i.e., inter-image relations among pixels from a same class, noted with same color in(b)). Pixel-wise contrastive learning is introduced to foster a new training paradigm (d), by explicitly addressing intra-class compactness and inter-class dispersion. Each pixel (embedding) i is pulled closer to pixels of the same class, but pushed far from pixels from other classes. Thus a better-structured embedding space (e) is de- rived, eventually boosting the performance of segmentation models.

Current semantic segmentation methods focus only on mining “local” context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure- aware optimization criteria (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frame- works without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3,HRNet,OCR) and backbones(i.e., ResNet, HR- Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL- Context, COCO-Stuff). We expect this work will encourage our community to rethink the current defacto training paradigm in fully supervised semantic segmentation1.

当前的语义分割模型关注挖掘局部上下文,例如:单个图像中像素之间的依赖,或者结构-感知的优化策略(IoU-like loss)。然而,他们忽略了训练数据中的全局上下文,例如,不同图像中限速之间的语义关系。




In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments
of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, the data of the NIST SRE2008 Part1 is used for training, and two datasets, the Switchboard Cellular (Part1) and the CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vectors and X-vectors+Attention, the obtained results show that the use of H-vectors can achieve a significantly better performance. Furthermore, the learned utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.




See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)


We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.




  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。

Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fufilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal variational autoencoder (MMVAE) to learn generative models on different sets of modalities, including a challenging image ↔ language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.








Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.








ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation

Method overview: in each round of active selec- tion, we first train a semantic segmentation network on the existing labeled data. Second, we use the trained network to compute a view entropy and a view divergence score for each unlabeled superpixel. We then select a batch of su- perpixels based on these scores, and finally request their re- spective labels from the oracle. This is repeated until the labeling budget is exhausted or all training data is labeled.

We propose ViewAL , a novel active learning strategy for semantic segmentation that exploits viewpoint consis- tency in multi-view datasets. Our core idea is that incon- sistencies in model predictions across viewpoints provide a very reliable measure of uncertainty and encourage the model to perform well irrespective of the viewpoint under which objects are observed. To incorporate this uncertainty measure, we introduce a new viewpoint entropy formula- tion, which is the basis of our active learning strategy. In addition, we propose uncertainty computations on a super- pixel level, which exploits inherently localized signal in the segmentation task, directly lowering the annotation costs. This combination of viewpoint entropy and the use of su- perpixels allows to efficiently select samples that are highly informative for improving the network. We demonstrate that our proposed active learning strategy not only yields the best-performing models for the same amount of required labeled data, but also significantly reduces labeling effort. Our method achieves 95% ofmaximum achievable network performance using only 7%, 17%, and 24% labeled data on SceneNet-RGBD, ScanNet, and Matterport3D, respec- tively. On these datasets, the best state-of-the-art method achieves the same performance with 14%, 27% and 33% la- beled data. Finally, we demonstrate that labeling using su- perpixels yields the same quality ofground-truth compared to labeling whole images, but requires 25% less time.

我们提出了一种新的语义分割主动学习策略viewAL,它利用了多视图数据集中的视点一致性。我们的核心思想是,不同视角的模型预测的不一致性提供了一个非常可靠的不确定性度量,并鼓励模型能够很好地执行,而不考虑观察对象的视角。为了引入这种不确定性度量,我们引入了一个新的观点熵公式,这是我们主动学习策略的基础。此外,我们提出了在超像素水平上的不确定性计算,在分割任务中利用固有的局部化信号,直接降低注释成本。视点熵和像素的使用相结合,可以有效地选择信息量高的样本来改善网络。我们证明,我们所提出的主动学习策略不仅能为相同数量的所需标记数据生成性能最好的模型,而且显著地减少了标记工作。我们的方法仅使用SceneNet RGBD、ScanNet和Matterport3D上的7%、17%和24%的标记数据,就可以实现95%的最大网络性能。在这些数据集上,最先进的方法可以获得相同的性能,分别为14%、27%和33%。最后,我们证明了使用su-perpixels标记与标记整个图像产生的地面真实质量相同,但所需的时间减少了25%

Variational Adversarial Active Learning

Our model learns the distribution of labeled data in a latent space using a VAE optimized using both reconstruction and adversarial losses. A binary classifier predicts unlabeled examples and sends them to an oracle for annotations. The VAE is trained to fool the adversarial network to believe that all the examples are from the labeled data while the adversarial classifier is trained to differentiate labeled from unlabeled samples.


Active learning aims to develop label-efficient algorithms by sampling the most representative queries to be labeled by an oracle. We describe a pool-based semi- supervised active learning algorithm that implicitly learns this sampling mechanism in an adversarial manner. Our method learns a latent space using a variational autoen- coder (VAE) and an adversarial network trained to discriminate between unlabeled and labeled data. The mini-max game between the VAE and the adversarial network is played such that while the VAE tries to trick the adversarial network into predicting that all data points are from the la- beled pool, the adversarial network learns how to discrim- inate between dissimilarities in the latent space. We exten- sively evaluate our method on various image classification and semantic segmentation benchmark datasets and estab- lish a new state of the art on CIFAR10/100, Caltech-256, ImageNet, Cityscapes, and BDD100K. Our results demon- strate that our adversarial approach learns an effective low dimensional latent space in large-scale settings and pro- vides for a computationally efficient sampling method.