Infrared and 3D skeleton feature fusion for RGB-D action recognition

A challenge of skeleton-based action recognition is the difficulty to classify actions with similar motions and object-related actions. Visual clues from other streams help in that regard. RGB data are sensible to illumination conditions, thus unusable in the dark. To alleviate this issue and still benefit from a visual stream, we propose a modular network (FUSION) combining skeleton and infrared data. A 2D convolutional neural network (CNN) is used as a pose module to extract features from skeleton data. A 3D CNN is used as an infrared module to extract visual cues from videos. Both feature vectors are then concatenated and exploited conjointly using a multilayer perceptron (MLP). Skeleton data also condition the infrared videos, providing a crop around the performing subjects and thus virtually focusing the attention of the infrared module. Ablation studies show that using pre-trained networks on other large scale datasets as our modules and data augmentation yield considerable improvements on the action classification accuracy. The strong contribution of our cropping strategy is also demonstrated. We evaluate our method on the NTU RGB+D dataset, the largest dataset for human action recognition from depth cameras, and report state-of-the-art performances.

基于骨架的动作识别算法难以识别微小的动作和物品有关的行为,而基于RGB的方法又容易收到光照、场景的影响。所以本文提出了一种基于骨骼和红外的融合动作识别模型,模型通过一个2D网络提取骨骼特征,一个3D网络提取红外视觉特征,拼接之后通过MLP给出动作类别。骨骼也可以作为红外图像的条件,为其挑选合适的关注位置。本文提出的模型在NTU RGB+D数据集上取得了SOTA的评价。


邮箱地址不会被公开。 必填项已用*标注