作者归档:Jessica Chan

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

代码地址:https://github.com/tfzhou/ContrastiveSeg

Main idea. Current segmentation models learn to map pixels (b) to an embedding space (c), yet ignoring intrinsic structures of labeled data (i.e., inter-image relations among pixels from a same class, noted with same color in(b)). Pixel-wise contrastive learning is introduced to foster a new training paradigm (d), by explicitly addressing intra-class compactness and inter-class dispersion. Each pixel (embedding) i is pulled closer to pixels of the same class, but pushed far from pixels from other classes. Thus a better-structured embedding space (e) is de- rived, eventually boosting the performance of segmentation models.

Current semantic segmentation methods focus only on mining “local” context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure- aware optimization criteria (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frame- works without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3,HRNet,OCR) and backbones(i.e., ResNet, HR- Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL- Context, COCO-Stuff). We expect this work will encourage our community to rethink the current defacto training paradigm in fully supervised semantic segmentation1.

当前的语义分割模型关注挖掘局部上下文,例如:单个图像中像素之间的依赖,或者结构-感知的优化策略(IoU-like loss)。然而,他们忽略了训练数据中的全局上下文,例如,不同图像中限速之间的语义关系。

本文提出了一种全监督条件下的像素级别的对比算法用于语义分割。核心思想是强迫属于统一个语义类别的像素embeddings更为相似。它提出了一个像素级别的度量学习范式,通过显式地探索标记像素的结构来实现的。

提出的发那个发可以不费力气地融入到现有的分割框架,但在测试阶段没有额外的开销。

H-VECTORS: UTTERANCE-LEVEL SPEAKER EMBEDDING USING A HIERARCHICAL
ATTENTION MODEL (ICASSP2020)

In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments
of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, the data of the NIST SRE2008 Part1 is used for training, and two datasets, the Switchboard Cellular (Part1) and the CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vectors and X-vectors+Attention, the obtained results show that the use of H-vectors can achieve a significantly better performance. Furthermore, the learned utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.

utterance:说话者

作者提出一种级联注意网络用于speaker识别和确认任务中生成说话人级别的embeddings(H-vectors)。由于一个说话人的不同部分对于speaker识别有不同的贡献,所以使用级联结构来学习说话人的局部信息和全局信息。

在提出的方法中,frame-level编码和注意被应用于片段输入并生成独立的片段向量。然后,片段级别注意被应用于片段向量去构建一个说话人表示。

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)

此图像的alt属性为空;文件名为截屏2021-01-04-下午3.59.41-1024x334.png

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

本文关注视频帧之间的内在相关性,利用全局co-attention机制来提升基于深度学习的方法,该方法关注于学习短时片段中的有判别性的前景特征而不是外观和运动。

通过联合计算以及附加co-attention响应到一个联合特征空间中,我们的co-attention层能为捕获全局相关性和场景上下文提供有效的的stage。

我们在成对的视频帧上训练训练COSNet,这样可以增强训练数据并且可以增加学习能力。分割阶段,co-attention模型通过共同处理多个参考帧来编码有用的信息,这些信息被用来推断频繁出现并且显著的前景目标。

  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。

Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fufilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal variational autoencoder (MMVAE) to learn generative models on different sets of modalities, including a challenging image ↔ language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.

学习涉及到多种数据模态的生成模型的动机是,学习更多模态之间的有用,可泛化的表示。本文通过以下四个原则成功的学习到这样的模型:

(1)共享和私有子空间的隐式潜在分解

(2)所有模式的一致联合生成

(3)跨独立模态的一致交叉生成

(4)通过多模态整合提升模型的独立模态学习能力

我们提出一种miture-of-experts多模态变分自编码器去学习不同模态的生成模型,包含image-language数据集。

LEARNING FACTORIZED MULTIMODAL REPRESENTATIONS

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

多模态可以提供额外的宝贵信息,然而从多模态数据中学习仍然有两个关键挑战:

(1)模型必须学习复杂的intra-modal和cross-modal的相互作用以用于预测

(2)在测试时,模型需要对未知的缺失或者噪声模态鲁棒

本文提出在跨模态数据和标签中优化一个生成-判别目标。文章提出了一个模型,此模型可以将表示分解为两个独立的因素:多模态判别的因素和模态特定的生成因素。

多模态判别因素被所有模态分享并且包含联合的多模态特征,这些特征是例如情感预测等判别特征所需要的。

模态特定的生成因素对于每个模态是独一无二的,它包含生成数据所需要的信息。

实验结果表明提出的模型可以学习有价值的多模态表示,并且在六个多模态数据集上可以实现可竞争的性能。提出的模型通过以独立的因素作为条件而展现出有弹性的生成能力,并且可以在没有明显损害性能的情况下重建丢失的模态。

ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation

Method overview: in each round of active selec- tion, we first train a semantic segmentation network on the existing labeled data. Second, we use the trained network to compute a view entropy and a view divergence score for each unlabeled superpixel. We then select a batch of su- perpixels based on these scores, and finally request their re- spective labels from the oracle. This is repeated until the labeling budget is exhausted or all training data is labeled.

We propose ViewAL , a novel active learning strategy for semantic segmentation that exploits viewpoint consis- tency in multi-view datasets. Our core idea is that incon- sistencies in model predictions across viewpoints provide a very reliable measure of uncertainty and encourage the model to perform well irrespective of the viewpoint under which objects are observed. To incorporate this uncertainty measure, we introduce a new viewpoint entropy formula- tion, which is the basis of our active learning strategy. In addition, we propose uncertainty computations on a super- pixel level, which exploits inherently localized signal in the segmentation task, directly lowering the annotation costs. This combination of viewpoint entropy and the use of su- perpixels allows to efficiently select samples that are highly informative for improving the network. We demonstrate that our proposed active learning strategy not only yields the best-performing models for the same amount of required labeled data, but also significantly reduces labeling effort. Our method achieves 95% ofmaximum achievable network performance using only 7%, 17%, and 24% labeled data on SceneNet-RGBD, ScanNet, and Matterport3D, respec- tively. On these datasets, the best state-of-the-art method achieves the same performance with 14%, 27% and 33% la- beled data. Finally, we demonstrate that labeling using su- perpixels yields the same quality ofground-truth compared to labeling whole images, but requires 25% less time.

我们提出了一种新的语义分割主动学习策略viewAL,它利用了多视图数据集中的视点一致性。我们的核心思想是,不同视角的模型预测的不一致性提供了一个非常可靠的不确定性度量,并鼓励模型能够很好地执行,而不考虑观察对象的视角。为了引入这种不确定性度量,我们引入了一个新的观点熵公式,这是我们主动学习策略的基础。此外,我们提出了在超像素水平上的不确定性计算,在分割任务中利用固有的局部化信号,直接降低注释成本。视点熵和像素的使用相结合,可以有效地选择信息量高的样本来改善网络。我们证明,我们所提出的主动学习策略不仅能为相同数量的所需标记数据生成性能最好的模型,而且显著地减少了标记工作。我们的方法仅使用SceneNet RGBD、ScanNet和Matterport3D上的7%、17%和24%的标记数据,就可以实现95%的最大网络性能。在这些数据集上,最先进的方法可以获得相同的性能,分别为14%、27%和33%。最后,我们证明了使用su-perpixels标记与标记整个图像产生的地面真实质量相同,但所需的时间减少了25%

Variational Adversarial Active Learning

Our model learns the distribution of labeled data in a latent space using a VAE optimized using both reconstruction and adversarial losses. A binary classifier predicts unlabeled examples and sends them to an oracle for annotations. The VAE is trained to fool the adversarial network to believe that all the examples are from the labeled data while the adversarial classifier is trained to differentiate labeled from unlabeled samples.

代码地址:https://github.com/sinhasam/vaal

Active learning aims to develop label-efficient algorithms by sampling the most representative queries to be labeled by an oracle. We describe a pool-based semi- supervised active learning algorithm that implicitly learns this sampling mechanism in an adversarial manner. Our method learns a latent space using a variational autoen- coder (VAE) and an adversarial network trained to discriminate between unlabeled and labeled data. The mini-max game between the VAE and the adversarial network is played such that while the VAE tries to trick the adversarial network into predicting that all data points are from the la- beled pool, the adversarial network learns how to discrim- inate between dissimilarities in the latent space. We exten- sively evaluate our method on various image classification and semantic segmentation benchmark datasets and estab- lish a new state of the art on CIFAR10/100, Caltech-256, ImageNet, Cityscapes, and BDD100K. Our results demon- strate that our adversarial approach learns an effective low dimensional latent space in large-scale settings and pro- vides for a computationally efficient sampling method.

主动学习的目的是通过对最有代表性的查询进行抽样,从而开发出标记效率高的算法。我们描述了一个pool-based半监督主动学习算法,它以一种对抗的方式隐式地学习这种采样机制。我们的方法使用一个可变的自编码器和一个训练用于区分标记了的和未标记的数据的对抗网络,来学习一种潜在空间。VAE和对抗网络的min-max博弈是这样进行的:VAE试图欺骗对抗网络预测所有数据点都来自标记的池子,对抗网络学习如何在潜在空间中区分不同不同。我们在图像分类和语义分割数据集上进行评估。