标签归档:Multimodal

Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fufilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal variational autoencoder (MMVAE) to learn generative models on different sets of modalities, including a challenging image ↔ language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.

学习涉及到多种数据模态的生成模型的动机是,学习更多模态之间的有用,可泛化的表示。本文通过以下四个原则成功的学习到这样的模型:

(1)共享和私有子空间的隐式潜在分解

(2)所有模式的一致联合生成

(3)跨独立模态的一致交叉生成

(4)通过多模态整合提升模型的独立模态学习能力

我们提出一种miture-of-experts多模态变分自编码器去学习不同模态的生成模型,包含image-language数据集。

LEARNING FACTORIZED MULTIMODAL REPRESENTATIONS

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

多模态可以提供额外的宝贵信息,然而从多模态数据中学习仍然有两个关键挑战:

(1)模型必须学习复杂的intra-modal和cross-modal的相互作用以用于预测

(2)在测试时,模型需要对未知的缺失或者噪声模态鲁棒

本文提出在跨模态数据和标签中优化一个生成-判别目标。文章提出了一个模型,此模型可以将表示分解为两个独立的因素:多模态判别的因素和模态特定的生成因素。

多模态判别因素被所有模态分享并且包含联合的多模态特征,这些特征是例如情感预测等判别特征所需要的。

模态特定的生成因素对于每个模态是独一无二的,它包含生成数据所需要的信息。

实验结果表明提出的模型可以学习有价值的多模态表示,并且在六个多模态数据集上可以实现可竞争的性能。提出的模型通过以独立的因素作为条件而展现出有弹性的生成能力,并且可以在没有明显损害性能的情况下重建丢失的模态。

M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual,  and Speech Cues
We use three modalities, speech, text and the facial features. We first extract features to obtain fs, ft, ff from the raw inputs, is, it and if (purple box). The feature vectors then are checked if they are effective. We use a indicator function Ie (Equation 1) to process the feature vectors (yellow box). These vectors are then passed into the classification and fusion network of M3ER to get a prediction of the emotion (orange box). At the inference time, if we encounter a noisy modality, we regenerate a proxy feature vector (ps, pt or pf ) for that particular modality (blue box).

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a persample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

我们提出了M3ER, 一种基于学习的多模态输入情绪识别方法.我们的方法结合多个同时发生的模态(例如人脸,文本,语音),在任何独立的模态中,相比于其它方法对传感器的噪声更为鲁棒. M3ER模型是一个新颖的数据驱动的多类融合方法,可以强调更可靠的线索并在每个样本的基础上抑制其他线索. 通过引入使用规范相关分析来区分无效和有效模态的检查步骤,M3ER对传感器噪声具有鲁棒性。 M3ER还会生成代理功能来代替无效模式。我们通过对两个基准数据集IEMOCAP和CMU-MOSEI进行实验来证明我们网络的效率。我们报告IEMOCAP的平均准确度为82.7%,CMU-MOSEI的平均准确度为89.0%,总体而言,比以前的工作提高了5%。

尚未找到代码资源