标签归档:Action Recognition

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

https://arxiv.org/abs/2103.15691

我们提出一种基于纯transformer的视频分类的模型,这种纯transformer的模型已经在图像分类领域取得了成功。我们的模型从输入视频中提取时空tokens,并且将其嵌入一系列transformer层中。为了处理长序列的tokens,我们提出了几种我们模型的变形用于在时间和空间域分解输入视频。尽管一般认为基于transformer的模型只有依赖大规模的训练集才能够应用,我们的模型却可以在正则化和预训练模型的帮助下在小数据集上取得匹敌大规模训练集的效果。我们在数个数据集上的测试表明我们的模型优于3D卷积网络。

An Image is Worth 16×16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. 

https://arxiv.org/abs/2103.13915

在动作识别领域领跑的方法们都需要从一个视频中提取空间和时间的信息。对于那些达到SOTA精度的方法经常使用3D卷积层来获得时间域的信息。使用卷积操作就代表着需要将视频切割成短片段后再进行处理,每一个片段是相邻的被采样的帧。这意味着为了覆盖整个视频,我们需要采样多个短片段以达到全覆盖的目的。这也使得计算量增加从而无法在实际应用中部署。我们通过减少采样帧数的方式极大地减少了计算量。我们的方法是用一个时域transformer通过全局注意力的方式覆盖视频帧,因此可以更好地利用每一帧的注意力信息。因此,我们的方法在处理输入信息的时候更加有效率并且可以达到SOTA的性能。

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a method that classifies actions by attending to the entire video sequence information. Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains 16.1× faster and runs 5.1× faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs. We report competitive results on Kinetics-400 and present an ablation study of VTN properties and the trade-off between accuracy and inference speed. We hope our approach will serve as a new baseline and start a fresh line of research in the video recognition domain.

https://arxiv.org/abs/2102.00719

在这篇文章中我们提出VTN,一种基于Transformer的视频识别框架。受到最近视觉Transformer的启发,我们发现现有的基本动作识别算法都是依赖3D卷积网络,而我们提出一种方法可以通过获取整个视频序列的信息做出预测的算法。我们的方法可以与任意2D空间网络结合,它可以比现有的SOTA方法训练上快16.1以及在推理的时候快5.1倍并保持相似的性能。当进行全局推理的时候,它对算力的需要少1.5 GFLOPs. 我们在Kinetics-400上展示了优异的性能,并且在消融实验中验证了VTN的性质以及如何平衡精确度和推理速度。我们希望我们的方法可以成为新的baseline,并且在视频识别领域占有一席之地。

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose  Estimation - YouTube

Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHRNet outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all top-down methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/Higher-HRNet-Human-Pose-Estimation.

https://arxiv.org/pdf/1908.10357.pdf

由下到上的人体姿态识别方法对于小尺寸的人体的识别存在困难,因为它们很难适应尺寸的变化。本文中提出一种由下到上的姿态识别方法HigherHRNet,核心在于使用高分辨率特征金字塔学习尺寸变化的表示。使用多分辨率监督学习和多分辨率聚合方式推理,本文提出的模型可以解决尺寸变化的问题。HigherHRNet 中使用的特征金字塔由HRNet输出的特征和向上抽样的特征组成。

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]

https://arxiv.org/abs/1905.04757

基于深度的人体行为分析取得了可喜的进展,这种方式也证明了3D表示在动作识别任务中的效果。现有的深度或者RGB+D的数据集或多或少都有些缺点,尤其是缺乏大规模的训练样本,大规模的类别,多重视角,多变环境条件以及实验人员限制等等。在本文中,我们提出了一个大规模的RGB+D的动作识别数据集,拥有106中单独的个体,超过114,000个视频,超过800万帧图像。这个数据集还包括了120种不一样的动作种类,包括日常动作、互动动作以及健康状态相关的动作。我们在这个数据集上测试了一系列现有的3D动作分析模型并且取得了深度学习在3D动作分析领域上具有优势的结论。另外我们还在数据集上验证了one-shot动作分析的任务并且取得了良好的效果。我们相信这个大规模数据集能够为社区所用,并且解决数据饥饿的问题。

Deep High-Resolution Representation Learning for Human Pose Estimation

Deep High-Resolution Representation Learning

In this paper, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: The COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/deep-high-resolution-net.pytorch.

https://arxiv.org/pdf/1902.09212.pdf

本文通过学习一个可靠的高分辨率表示来解决姿态估计的问题。绝大多数已有的方法才用从低分辨率表示恢复高分辨率的方式来构建高到低分辨率网络。与此相反,我们提出的网络在整个流程中都会修正高分辨率表示。我们在最初的阶段才用一个高分辨子网络,然后渐渐地在后续阶段中加入高到低分辨率网络,然后并行地将这些多分辨率网络连接起来。我们发现这种方式可以很好地混合各个分辨率的信息,有效地丰富高分辨率表示。实验表明,关键点预测的热图变得更加精确,在各种benchmark上都取得了良好的效果。

Deep Attention Network for Egocentric Action Recognition

Deep Attention Network for Egocentric Action Recognition | Semantic Scholar

Recognizing a camera wearer’s actions from videos captured by an egocentric camera is a challenging task. In this paper, we employ a two-stream deep neural network composed of an appearance-based stream and a motion-based stream to recognize egocentric actions. Based on the insight that human action and gaze behavior are highly coordinated in object manipulation tasks, we propose a spatial attention network to predict human gaze in the form of attention map. The attention map helps each of the two streams to focus on the most relevant spatial region of the video frames to predict actions. To better model the temporal structure of the videos, a temporal network is proposed. The temporal network incorporates bi-directional long short-term memory to model the long-range dependencies to recognize egocentric actions. The experimental results demonstrate that our method is able to predict attention maps that are consistent with human attention and achieve competitive action recognition performance with the state-of-the-art methods on the GTEA Gaze and GTEA Gaze+ datasets.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8653357

本文提出了一个孪生网络,通过学习动作和外观信息从而识别特定动作。动机是人类的动作和凝视行为在对物品操作行为的时候是相互配合的,我们提出一个空间注意力网络去预测一个人类凝视行为的注意力图谱,这个注意力图谱可以帮助孪生网络将注意力集中到最相关的空间区域。为了更加好地建模时域模型,本文提出了一个时域网络,网络使用双向lstm架构来建模长时间的依赖用于识别动作。实验结果表明本方法可以有效地建立注意力图谱并且获得了具有竞争力的性能。

Infrared and 3D skeleton feature fusion for RGB-D action recognition

A challenge of skeleton-based action recognition is the difficulty to classify actions with similar motions and object-related actions. Visual clues from other streams help in that regard. RGB data are sensible to illumination conditions, thus unusable in the dark. To alleviate this issue and still benefit from a visual stream, we propose a modular network (FUSION) combining skeleton and infrared data. A 2D convolutional neural network (CNN) is used as a pose module to extract features from skeleton data. A 3D CNN is used as an infrared module to extract visual cues from videos. Both feature vectors are then concatenated and exploited conjointly using a multilayer perceptron (MLP). Skeleton data also condition the infrared videos, providing a crop around the performing subjects and thus virtually focusing the attention of the infrared module. Ablation studies show that using pre-trained networks on other large scale datasets as our modules and data augmentation yield considerable improvements on the action classification accuracy. The strong contribution of our cropping strategy is also demonstrated. We evaluate our method on the NTU RGB+D dataset, the largest dataset for human action recognition from depth cameras, and report state-of-the-art performances.

http://arxiv.org/abs/2002.12886

基于骨架的动作识别算法难以识别微小的动作和物品有关的行为,而基于RGB的方法又容易收到光照、场景的影响。所以本文提出了一种基于骨骼和红外的融合动作识别模型,模型通过一个2D网络提取骨骼特征,一个3D网络提取红外视觉特征,拼接之后通过MLP给出动作类别。骨骼也可以作为红外图像的条件,为其挑选合适的关注位置。本文提出的模型在NTU RGB+D数据集上取得了SOTA的评价。