标签归档:Video Object Segmentation

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.


在本文中我们介绍一个基于Transformer的视频目标分割方法(VOS). 为了解决之前存在的复合误差以及可拓展性的问题,我们提出了一个可拓展的,端到端的VOS方法佳作稀疏时空Transformers (SST). SST使用稀疏注意力机制为视频中每一个目标提取时空的像素特征。我们的基于注意力的VOS公式使得模型可以学习到多帧的历史并且为动作分割相关性计算提供归纳偏差。我们展示了在循环网络上的注意力机制的有效性。我们的方法在YouTube-VOS和DAVIS 2017上取得了具有竞争力的成绩,同时也展示了它相较于其他SOTA方法的可拓展性以及鲁棒性。

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)


We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.




  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。