标签归档:Attention Mechanism

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.


近来,通道注意力机制已经被证明可以极大的提升深度卷积神经网络的性能。但是现有的方法为了获得更好的性能往往采用更为复杂的注意力模型因此反而增加了模型的复杂度。为了解决性能和计算复杂度的平衡,本文提出了高效通道注意力模块 (ECA) 使得使用较少的参数而可以获得明显的性能提升。通过研究SENet的通道注意力模型,我们从经验上推论减少维度对于学习通道注意力至关重要,适当的通道间交互可以在保持性能的同时极大地提高模型的复杂度。因此我们提出了一种通过一维卷积实现的本地跨通道交互策略而不需要减少维度。另外,我们还提出一种自适应选择一维卷积核尺寸的方法以衡量通道之间的交互的程度。ECA模块相较于基本的RseNet50有较大提升并可以用于多种如图像识别,目标检测以及实例分割等以ResNet和MobileNetV2为框架的任务中。

FcaNet: Frequency Channel Attention Networks


Attention mechanism, especially channel attention, has gained great success in the computer vision field. Many works focus on how to design efficient channel attention mechanisms while ignoring a fundamental problem, i.e., using global average pooling (GAP) as the unquestionable pre-processing method. In this work, we start from a different view and rethink channel attention using frequency analysis. Based on the frequency analysis, we mathematically prove that the conventional GAP is a special case of the feature decomposition in the frequency domain. With the proof, we naturally generalize the pre-processing of channel attention mechanism in the frequency domain and propose FcaNet with novel multi-spectral channel attention. The proposed method is simple but effective. We can change only one line of code in the calculation to implement our method within existing channel attention methods. Moreover, the proposed method achieves state-of-the-art results compared with other channel attention methods on image classification, object detection, and instance segmentation tasks. Our method could improve by 1.8% in terms of Top-1 accuracy on ImageNet compared with the baseline SENet-50, with the same number of parameters and the same computational cost. Our code and models will be made publicly available.

注意力机制,尤其是通道注意力,类似的方法已经在计算机视觉领域取得了巨大的成功。很多工作专注于设计高效的通道注意力机制而忽略了一个基本的问题:利用全局平均池化(GAP)作为一个标准的预处理方法。在本文中,我们从一个不一样的视角去利用频域的方法去考量通道注意力。根据频域分析,我们从数学上证明了常规的GAP是一种在频率域上特征分解的特殊情况。根据证明,我们自然可以将在频域上的通道注意力机制预处理流程拓展并且提出一种基于多频谱通道注意力机制的模型:FcaNet. 我们提出的方法简单但是高效。我们可以仅仅在现有的通道注意力模型中只修改一行代码实现我们提出的模型。另外,我们的方法取得了SOTA的性能在图像识别,目标检测,实例分割等领域。我们的方法将ImageNet的Top-1成绩提高了1.8%,使用的是SENet-50作为基线,参数量和算力不变。我们的代码和模型将会公开。


In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments
of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, the data of the NIST SRE2008 Part1 is used for training, and two datasets, the Switchboard Cellular (Part1) and the CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vectors and X-vectors+Attention, the obtained results show that the use of H-vectors can achieve a significantly better performance. Furthermore, the learned utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.




See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR2019)


We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments.

The co attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space.

We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better.

We propose a unified and end-to end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.




  • idea:
    • 作者提出一种co-attention,基于一个视频序列全局角度,来提升UVOS的精度。(确实领先目前的很多模型,davis官网的数据)。以往的一些方法,有通过显著性检测得到所要分割的目标,或者通过有限帧之间计算出的光流信息。COSNet则从整个视频序列中考虑哪个目标是需要分割的。在测试阶段,COSNet会综合所有前面的帧得到的信息,推理出当前帧中哪个目标是显著的同时还是经常出现的。Co-attention模块挖掘了视频帧之间丰富的上下文信息。基于co-attention,作者提出了COSNet(co attention Siamese)来从一个全局视角建模UVOS 。现在可能读者还是不能理解这个全局视角是什么,在method部分会解释。
  • contribution:
    • COSNet采用的训练方式是考虑一个pair,包含相同视频中的任意两帧,所以说极大的增加了数据量,不需要考虑时序关系,依次送入数据,而是可以打乱数据,随机组合。
    • 显示建模帧和帧的联系,不依赖光流
    • 统一的,端到端、可训练的高效网络
  • unsupervised:
    • UVOS中的unsupervised指的是不给定前景目标,通过网络自动判断哪个是前景目标。而非传统意义的label不参与训练过程。

Attention Is All You Need

A Paper A Day: #24 Attention Is All You Need | by Amr Sharaf | Medium

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.



Visual Concept Reasoning Networks

Alex Naka on Twitter: "Visual Concept Reasoning Networks  https://t.co/WnGgez392j… "

A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or interactions between these representations are typically defined by dense and local operations, however, without any adaptiveness or high-level reasoning. In this work, we propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. We associate each branch with a visual concept and derive a compact concept state by selecting a few local descriptors through an attention module. These concept states are then updated by graph-based interaction and used to adaptively modulate the local descriptors. We describe our proposed model by split-transform-attend-interact-modulate-merge stages, which are implemented by opting for a highly modularized architecture. Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.


拆分-转换-合并的策略已经被广泛地应用于视觉理解的卷积神经网络设计中。它利用一系列显式多路径的稀疏连接同时学习视觉信息和特性的表达。表达的相互依赖或者内部关联由密集的本地操作定义,但是没有考虑到适应性和高层级的因果关系。本文研究将上述策略与isual Concept Reasoning Networks (VCRNet) 合并似的上述策略能够学习到高层视觉信息的因果特性。我们将每一个支路与一个视觉信息合并,并且经由一些本地描述器构成的注意力模块以获得一个紧凑的特征。这些信息可以由基于图的互联更新,并且能自适应地构建本地描述器。本文提出的模型由拆分-转换-注册-互联-建模-合并等步骤构成,它是面对高层级的建模。实验证明在多重视觉理解任务中本模型都能取得良好的成绩。