This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT’14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.
本文提出一种全方向表示的Transformers (OmniNet). 在OmniNet中，我们没有严格地设定一个水平的感受野，而是任意一个token都可以接触到整个网络中的所有tokens. 这个过程也可以看作是一种扩展的注意力机制，这种注意力机制拥有整个网络的感受野。通过上述过程，OmniNet可以作为一个meta-leraner进行训练，这也是另一种基本的自注意力机制模型。为了缓解全局注意力机制带来的复杂计算量，我们参考了其他高效自注意力模型例如基于核，低阶注意力和Big Bird meta-learner. 实验证明，在NLP和视觉任务上OmniNet都有不错的效果。