A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or interactions between these representations are typically defined by dense and local operations, however, without any adaptiveness or high-level reasoning. In this work, we propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. We associate each branch with a visual concept and derive a compact concept state by selecting a few local descriptors through an attention module. These concept states are then updated by graph-based interaction and used to adaptively modulate the local descriptors. We describe our proposed model by split-transform-attend-interact-modulate-merge stages, which are implemented by opting for a highly modularized architecture. Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
拆分-转换-合并的策略已经被广泛地应用于视觉理解的卷积神经网络设计中。它利用一系列显式多路径的稀疏连接同时学习视觉信息和特性的表达。表达的相互依赖或者内部关联由密集的本地操作定义，但是没有考虑到适应性和高层级的因果关系。本文研究将上述策略与isual Concept Reasoning Networks (VCRNet) 合并似的上述策略能够学习到高层视觉信息的因果特性。我们将每一个支路与一个视觉信息合并，并且经由一些本地描述器构成的注意力模块以获得一个紧凑的特征。这些信息可以由基于图的互联更新，并且能自适应地构建本地描述器。本文提出的模型由拆分-转换-注册-互联-建模-合并等步骤构成，它是面对高层级的建模。实验证明在多重视觉理解任务中本模型都能取得良好的成绩。