LEARNING FACTORIZED MULTIMODAL REPRESENTATIONS

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

多模态可以提供额外的宝贵信息,然而从多模态数据中学习仍然有两个关键挑战:

(1)模型必须学习复杂的intra-modal和cross-modal的相互作用以用于预测

(2)在测试时,模型需要对未知的缺失或者噪声模态鲁棒

本文提出在跨模态数据和标签中优化一个生成-判别目标。文章提出了一个模型,此模型可以将表示分解为两个独立的因素:多模态判别的因素和模态特定的生成因素。

多模态判别因素被所有模态分享并且包含联合的多模态特征,这些特征是例如情感预测等判别特征所需要的。

模态特定的生成因素对于每个模态是独一无二的,它包含生成数据所需要的信息。

实验结果表明提出的模型可以学习有价值的多模态表示,并且在六个多模态数据集上可以实现可竞争的性能。提出的模型通过以独立的因素作为条件而展现出有弹性的生成能力,并且可以在没有明显损害性能的情况下重建丢失的模态。

发表评论

邮箱地址不会被公开。 必填项已用*标注