An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。

发表评论

邮箱地址不会被公开。 必填项已用*标注