See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)

此图像的alt属性为空;文件名为截屏2021-01-04-下午3.59.41-1024x334.png

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

本文关注视频帧之间的内在相关性,利用全局co-attention机制来提升基于深度学习的方法,该方法关注于学习短时片段中的有判别性的前景特征而不是外观和运动。

通过联合计算以及附加co-attention响应到一个联合特征空间中,我们的co-attention层能为捕获全局相关性和场景上下文提供有效的的stage。

我们在成对的视频帧上训练训练COSNet,这样可以增强训练数据并且可以增加学习能力。分割阶段,co-attention模型通过共同处理多个参考帧来编码有用的信息,这些信息被用来推断频繁出现并且显著的前景目标。

  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。

发表评论

邮箱地址不会被公开。 必填项已用*标注