Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fufilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal variational autoencoder (MMVAE) to learn generative models on different sets of modalities, including a challenging image ↔ language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.

学习涉及到多种数据模态的生成模型的动机是,学习更多模态之间的有用,可泛化的表示。本文通过以下四个原则成功的学习到这样的模型:

(1)共享和私有子空间的隐式潜在分解

(2)所有模式的一致联合生成

(3)跨独立模态的一致交叉生成

(4)通过多模态整合提升模型的独立模态学习能力

我们提出一种miture-of-experts多模态变分自编码器去学习不同模态的生成模型,包含image-language数据集。

发表评论

邮箱地址不会被公开。 必填项已用*标注