标签归档:Video Object Segmentation

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.

https://arxiv.org/abs/2101.08833

在本文中我们介绍一个基于Transformer的视频目标分割方法(VOS). 为了解决之前存在的复合误差以及可拓展性的问题,我们提出了一个可拓展的,端到端的VOS方法佳作稀疏时空Transformers (SST). SST使用稀疏注意力机制为视频中每一个目标提取时空的像素特征。我们的基于注意力的VOS公式使得模型可以学习到多帧的历史并且为动作分割相关性计算提供归纳偏差。我们展示了在循环网络上的注意力机制的有效性。我们的方法在YouTube-VOS和DAVIS 2017上取得了具有竞争力的成绩,同时也展示了它相较于其他SOTA方法的可拓展性以及鲁棒性。

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)

此图像的alt属性为空;文件名为截屏2021-01-04-下午3.59.41-1024x334.png

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

本文关注视频帧之间的内在相关性,利用全局co-attention机制来提升基于深度学习的方法,该方法关注于学习短时片段中的有判别性的前景特征而不是外观和运动。

通过联合计算以及附加co-attention响应到一个联合特征空间中,我们的co-attention层能为捕获全局相关性和场景上下文提供有效的的stage。

我们在成对的视频帧上训练训练COSNet,这样可以增强训练数据并且可以增加学习能力。分割阶段,co-attention模型通过共同处理多个参考帧来编码有用的信息,这些信息被用来推断频繁出现并且显著的前景目标。

  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。