作者归档:Jessica Chan

Content-Aware Unsupervised Deep Homography Estimation

论文地址:https://arxiv.org/pdf/1909.05983.pdf

项目地址:https://github.com/JirongZhang/DeepHomography

Homography estimation is a basic image alignment method in many applications. It is usually conducted by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real world applications. To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art including deep solutions and feature-based solutions.

在许多应用中,单应性估计是一种基本的图像对齐方法。它通常是通过提取和匹配稀疏特征点来进行的,这些稀疏特征点在弱光、低纹理图像中容易出错。另一方面,以往的深度单应性方法要么使用合成图像进行有监督学习,要么使用航空图像进行无监督学习,都忽视了在实际应用中处理深度差异和运动物体的重要性。为了克服这些问题,在这项工作中,我们提出了一个无监督的深层单应性方法和一个新的架构设计。根据传统方法中RANSAC程序的精神,我们专门学习了一个离群值掩码,只选择可靠的区域进行单应性估计。我们根据所学的深度特征计算损失,而不是像以前那样直接比较图像内容。为了实现无监督训练,我们还为我们的网络设计了一种新的三重态损耗。我们通过对一个新的数据集进行全面的比较来验证我们的方法,该数据集涵盖了任务的不同难度的各种场景。实验结果表明,该方法的性能优于现有的深度解和基于特征的解。

Abstract

单应性可以将从不同角度拍摄的图像对齐,如果它们大致经历旋转运动或场景近似于平面[13]。对于满足约束的场景,单纯形可以直接对齐它们。对于违反约束的场景,例如,包含多个平面或包含移动对象的场景,单应性通常是在更高级的模型(如网格流[20]和光流[16])之前的初始对齐模型。大多数情况下,这种预对准对最终质量至关重要。因此,单应性被广泛应用于多帧HDR成像[10]、多帧图像超分辨率[34]、突发图像去噪[22]、视频稳定[21]、图像/视频拼接[36,12]、SLAM[26,42]、增强现实[30]和相机校准[40]。

近年来,随着深度神经网络(DNN)的发展,基于DNN的单应性估计方法逐渐被提出,如有监督的[7]和无监督的[27]等。对于前一种方法,它要求单应性作为地面真实性(GT)来监督训练,从而只能生成经过GT单应性扭曲的合成目标图像。虽然合成图像对可以在任意的尺度下产生,但由于训练数据中没有真实的深度差,所以它们与实际情况相差甚远。因此,该方法对真实图像的泛化能力较差。为了解决这个问题,阮等人。提出了后一种无监督的解决方案[27],该方案最大限度地减少了实际图像对的光度损失。然而,这种方法有两个主要问题。一种是根据图像强度计算的损耗不如在特征空间中计算的有效,并且在整个图像中均匀地计算损耗,而忽略了类RANSAC过程。因此,该方法不能排除运动目标或非平面目标造成的最终损失,从而可能降低估计精度。为了避免上述现象,阮等。[27]必须对远离相机的航空图像进行处理,以尽量减少视差深度变化的影响。

为了解决上述问题,我们提出了一种基于内容感知学习的单应性估计的无监督解决方案。它是专门为具有小基线的图像对而设计的,因为这种情况通常适用于连续视频帧、突发图像捕获或双摄像头手机拍摄的照片。特别是,为了稳健地优化单应性,我们的网络隐式地学习了一个深层的对齐特征和一个内容感知掩码来同时拒绝离群区域。学习的特征用于损失计算,而不是像[7]中那样使用光度损失,学习内容感知掩码使网络集中于重要和可注册的区域。我们进一步提出了一种新的三重态损失来优化网络,从而实现无监督学习。实验结果证明了所有新涉及的技术对我们的网络的有效性,定性和定量评估也表明我们的网络优于最先进的技术,如图所示。1、6和7。我们还介绍了一个全面的图像对数据集,它包含5类场景以及人类标记的GT点对应关系,用于其验证集的定量评估(图5)。总之,我们的主要贡献是:

–一种新颖的网络结构,能够在较小基线的情况下从两幅图像中进行内容感知鲁棒单应性估计。
–为无监督训练设计的三重态损失,以便可以生成一个最佳单应矩阵作为输出,以及一个用于对齐的深层特征图和一个突出显示作为中间结果隐式学习的对齐内联的掩码。
–一个全面的数据集涵盖了图像对齐模型的无监督训练的各种场景,包括但不限于单应性、网格扭曲或光流。

Related works

传统单应性。单应性是一个3×3矩阵,它补偿两幅图像之间的平面运动。它由8个自由度(DOF)组成,每个2个自由度分别用于缩放、平移、旋转和透视[13]。为了解决单应性问题,传统的方法通常是检测和匹配图像特征,如SIFT[23]、SURF[4]、ORB[29]、LPM[25]、GMS[5]、SOSNet[32]、LIFT[35]和OAN[38]。在两幅图像之间建立了两组对应关系,然后采用鲁棒估计方法,如经典的RANSAC[9]、IRLS[15]和MAGSAC[3],在模型估计过程中剔除异常值。在没有图像特征的情况下,也可以直接求解同伦问题。直接方法,如开创性的卢卡斯-卡纳德算法[24],计算两幅图像之间的平方差之和(SSD)。这些差异引导图像的移动,产生单应性更新。用这种方法迭代优化随机初始化的单应性[2]。此外,为了提高鲁棒性,可以用增强相关系数(ECC)代替SSD[8]。

深度单应性。继光流[33,16]、密集匹配[28]、学习描述子[32]和深度特征[1]等各种深度图像对齐方法的成功之后,2016年[7]首次提出了深度单应性解决方案。该网络以源图像和目标图像为输入,生成源图像的4个角点位移矢量,从而得到单应性。采用GT单纯形法对训练进行监控。然而,用GT单应性生成的训练图像没有深度差。为了克服这一问题,阮等。[27]提出了一种无监督的方法来计算两幅图像之间的光度损失,并采用空间变换网络(STN)[17]进行图像扭曲。然而,他们计算的损失直接在强度和一致的图像平面。相比之下,我们学习内容感知面具。值得注意的是,用于有效估计的预测掩模在其他任务中也有尝试,例如单目深度估计[41,11]。本文介绍了无监督单应性学习。

图像拼接。传统的全景图像拼接方法[36,37]是针对大尺寸图像的拼接方法[6]。缝合后的图像往往拍摄到的视角差异很大。在这项工作中,我们将重点放在具有小基线的多帧图像上。

Algorithm

3.1 Network Structure

我们的方法建立在卷积神经网络的基础上。它以Ia和Ib两个灰度图像块为输入,从Ia到Ib生成一个单应矩阵Hab作为输出。整个结构可分为三个模块:特征提取器f(·)、掩模预测器m(·)和单应性估计器h(·)。f(·)和m(·)是完全卷积网络,可以接受任意大小的输入,h(·)利用ResNet-34的主干网[14]产生8个值。图2(a)示出了网络结构。

特征提取。与以往直接利用像素强度值作为特征的基于DNN的方法不同,我们的网络可以从输入中自动学习一个深度特征,以实现鲁棒的特征对齐。为此,我们构建了一个全卷积网络(FCN),它以H×W×1为输入,生成H×W×C的特征映射,对于输入Ia和Ib,特征抽取器共享权值并生成特征映射Fa和Fb,即。

当应用于loss计算时,相比于像素强度,学习到的特征更加鲁棒。特别对于有亮度变化的图像。

mask predictor. 在非平面场景,特别是那些包含移动目标的,没有单独的单应变换可以用于连接两个视角。在传统的算法中,RANSAC被广泛应用于单应变换中搜寻内点,从而解出场景对齐中的最大估计矩阵。参照了相似的想法,我们构建了一个子网络用于自动地学习内点的位置。特别地,一个子网络m()学习一个内点概率图或者mask, 强调特征图中对于单应估计贡献比较多的内容。mask的尺寸与特征图相同。我们利用得到的masks进一步加权提取到的特征图f, 在f被送到单应估计器之前。从而我们获得两个加权的特征图如下。

学习到的mask有两个作用,一方面用于作为注意图,另一方面作为outlier rejecter.

Homography estimator. 给定加权的特征图Ga和Gb, 我们拼接他们以构建一个特征图。然后它被输入到单应估计网络中,接下来4个2Doffset(偏移?)向量(8个值)被生成。有了4个偏移向量,可以通过解一个线性系统来直接获得8个自由度的齐次矩阵。我们利用h()去表示整个过程。

骨干h()遵照ResNet-34结构。

3.2 Triplet Loss for Robust Homography Estimation

有了估计的单应矩阵Hab,我们扭曲图像Ia到Ia’,然后进一步地提取它的特征图F’a。如果单应矩阵足够精确,F’a应该对齐到Fb,导致两者之间较低的l1损失。考虑到现实场景中,仅仅一个单应矩阵通常不能满足两个视角之间的变换,我们也计算M’a和Mb之间的l1损失。扭曲的Ia和Ib之间的损失如下,

直接最小化等式4很容易导致琐碎的解决方案,其中特征提取器只产生所有的零映射, F’a=Fb=0。在这种情况下,学习到的特征确实描述了I’a和Ib“很好地对齐”的事实,但是它不能反映出原始图像Ia和Ib是错误对齐的。为此,我们涉及到了另一场Fa和Fb之间的损失,

并且进一步最大化它,当最小化上一个公式。这个策略避免了琐碎全零解,并且使得网络学习一个判别的特征图。

与上面的操作相似,我们同样计算了Ln(I’b, Ia)。我们也加了一个限制,强迫Hab和Hba是可逆的。因此,网络的优化过程可以表示为:

3.3 Unsupervised Content-Awareness Learning

如前所述,我们的网络包含一个子网络m(·)来预测一个更高的概率掩码。它的设计使得我们的网络可以通过两个角色来实现内容感知。首先,我们使用掩模Ma,Mb显式地对特征Fa,Fb进行加权,使得只有突出显示的特征才能完全输入单应性估计器h(·)。这些掩模实际上是特征映射的注意力映射。第二,它们也隐含在归一化损失公式4中,作为一个加权项。通过这样做,只考虑那些真正适合对齐的区域。对于那些包含低纹理或移动前景的区域,由于其不可分辨或对对齐有误导性,因此在优化三重线损耗时,自然会将其去除用于单应性估计。这样的内容意识是完全通过无监督的学习方案实现的,没有任何GT掩模数据作为监督。为了证明面具作为两种角色的有效性,我们通过禁用面具作为注意力图或损失加权项的效果进行了消融研究。如表2(c)所示,在任何一种情况下,当去除掩模时,精度都会显著降低。

我们还在图4中举例说明遮罩效能。例如,在图4(a)(b)中,场景包含大的动态前景,我们的网络成功地拒绝了移动的对象,即使运动不明显如(b)中的喷泉,或者对象占据了(a)中的大空间。这些情况下,RANSAC很难找到健壮的内联线。图4(c)是低纹理的示例,其中天空和雪地几乎占据了整个图像。由于无法提供足够的特征匹配,传统的方法具有很大的挑战性。我们所预测的掩模集中在对准的地平线上。最后,图4(d)是低光示例,其中只有可见区域包含如所示的权重。我们还举例说明一个例子,以在图4的底部2行中显示作为单独角色的掩码的两个效果。关于这项烧蚀研究的细节将在第二节后面介绍。4.3。

Adversarial Examples Are Not Bugs, They Are Feature(MIT 2019)

Adversarial Examples Are Not Bugs | by Christopher Dossman | AI³ | Theory,  Practice, Business | Medium

Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.

对抗性例子在机器学习中引起了极大的关注,但是它们存在和普遍存在的原因仍然不清楚。我们证明,敌对的例子可以直接归因于非健壮特征的存在:高度预测性的特征(从数据分布的模式中导出),但脆弱,因此(对人类而言)不可理解。在一个理论框架内捕捉到这些特征后,我们在标准数据集中建立了它们的广泛存在。最后,我们提出了一个简单的设置,在这里我们可以严格地将我们在实践中观察到的现象与(人类指定的)鲁棒性概念和数据的固有几何形状之间的不对准联系起来。

论文分析好文:https://zhuanlan.zhihu.com/p/129063563

Semantic Segmentation of Pathological Lung Tissue With Dilated Fully Convolutional Networks

PDF] Semantic Segmentation of Pathological Lung Tissue With Dilated Fully  Convolutional Networks | Semantic Scholar

Early and accurate diagnosis of interstitial lung diseases (ILDs) is crucial for making treatment decisions, but can be challenging even for experienced radiologists. The diagnostic procedure is based on the detection and recognition of the different ILD pathologies in thoracic CT scans, yet their manifestation often appears similar. In this study, we propose the use of a deep purely convolutional neural network for the semantic segmentation of ILD patterns, as the basic component of a computer aided diagnosis system for ILDs. The proposed CNN, which consists of convolutional layers with dilated filters, takes as input a lung CT image of arbitrary size and outputs the corresponding label map. We trained and tested the network on a data set of 172 sparsely annotated CT scans, within a cross-validation scheme. The training was performed in an end-to-end and semisupervised fashion, utilizing both labeled and nonlabeled image regions. The experimental results show significant performance improvement with respect to the state of the art.

间质性肺疾病(ILDs)的早期准确诊断对于做出治疗决策至关重要,但即使对有经验的放射科医生来说也是一项挑战。诊断程序基于在胸部CT扫描中对不同ILD病理的检测和识别,然而它们的表现通常看起来是相似的。在这项研究中,我们建议使用深度纯卷积神经网络进行ILD模式的语义分割,作为计算机辅助诊断系统的基本组成部分。所提出的由具有扩张滤波器的卷积层组成的CNN将任意大小的肺部CT图像作为输入,并输出相应的标签图。在交叉验证方案中,我们在172个稀疏注释的CT扫描数据集上训练和测试了该网络。训练以端对端和半监督的方式进行,利用标记和未标记的图像区域。实验结果表明,相对于现有技术水平,性能有显著提高。

项目地址:https://github.com/intact-project/LungNet

论文地址:https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8325482

Global Self-Attention Networks.

Recently, a series of works in computer vision have shown promising results on
various image and video understanding tasks using self-attention. However, due
to the quadratic computational and memory complexities of self-attention, these
works either apply attention only to low-resolution feature maps in later stages of
a deep network or restrict the receptive field of attention in each layer to a small
local region. To overcome these limitations, this work introduces a new global
self-attention module, referred to as the GSA module, which is efficient enough
to serve as the backbone component of a deep network. This module consists of
two parallel layers: a content attention layer that attends to pixels based only on
their content and a positional attention layer that attends to pixels based on their
spatial locations. The output of this module is the sum of the outputs of the two
layers. Based on the proposed GSA module, we introduce new standalone global
attention-based deep networks that use GSA modules instead of convolutions to
model pixel interactions. Due to the global extent of the proposed GSA module,
a GSA network has the ability to model long-range pixel interactions throughout
the network. Our experimental results show that GSA networks outperform the
corresponding convolution-based networks significantly on the CIFAR-100 and
ImageNet datasets while using less parameters and computations. The proposed
GSA networks also outperform various existing attention-based networks on the
ImageNet dataset.

由于计算量以及自注意的复杂性,目前大多计算机视觉技术仅仅将自注意应用于低分辨率特征图或者限制每层注意的感受野至一个比较小的区域。为了克服这些显示,本文提出了一种新颖全局自注意模块,命名为GSA模块,作为深度网路偶的骨干内容是足够有效的。这个模块由两个并列的层组成:一个关注内容的内容注意层以及一个关注空间位置的位置注意层。这个模块的输出是这两个层的和。基于提出的GSA模块,我们提出了新的基于全局注意的深度网络,这个网络利用GSA模块而不是卷积去建模像素的相互影响。由于提出的GSA模块的全局内容,一个GSA网络有能力去建模较长范围内的像素相关性。

论文地址:https://openreview.net/pdf?id=KiFeuZu24k

代码地址:https://github.com/lucidrains/global-self-attention-network

Stand-Alone Self-Attention in Vision Models

Stand-Alone Self-Attention in Vision Models review - Jaehwi's ML Log

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner’s toolbox.

论文提出stand-alone self-attention layer,并且构建了full attention model,验证了content-based的相互关系能够作为视觉模型特征提取的主要基底。在图像分类和目标检测实验中,相对于传统的卷积模型,在准确率差不多的情况下,能够大幅减少参数量和计算量,论文的工作有很大的参考意义。

目前卷积网络的设计是提高图像任务性能的关键,而卷积操作由于平移不变性使其成为了图像分析的主力。受限于感受域的大小设定,卷积很难获取长距离的像素关系,而在序列模型中,已经能很好地用attention来解决这个问题。目前,attention模块已经开始应用于传统卷积网络中,比如channel-based的attention机制 Squeeze-Excite和spatially-aware的attention机制Non-local Network等。这些工作都是将global attention layers作为插件加入到目前的卷积模块中,这种全局形式考虑输入的所有空间位置,当输入很小时,由于网络需要进行大幅下采样,通常特征加强效果不好。
因此,论文提出简单的local self-attention layer,将内容之间的关系(content-based interactions)作为主要特征提取工具而不是卷积的增强工具,能够同时处理大小输入,另外也使用这个stand-alone attention layer来构建全attention的视觉模型,在图像分类和目标定位上的性能比全卷积的baseline要好。

论文地址:https://arxiv.org/pdf/1906.05909.pdf

EDVR: Video Restoration with Enhanced Deformable Convolutional Networks

EDVR: Video Restoration with Enhanced Deformable Convolutional Networks

Video restoration tasks, including super-resolution, deblurring, etc, are drawing increasing attention in the computer vision community. A challenging benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark challenges existing methods from two aspects:
(1) how to align multiple frames given large motions, and
(2) how to effectively fuse different frames with diverse motion and blur. In this work, we propose a novel Video Restoration framework with Enhanced Deformable convolutions, termed EDVR, to address these challenges. First, to handle large motions, we devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, we propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration. Thanks to these modules, our EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. EDVR also demonstrates superior performance to state-of-the-art published methods on video
super-resolution and deblurring.

  • REDS benchmark对现有的方法有很大的挑战:
    • 给定大幅度的运动时,如何连接多帧
    • 如何有效地融合包含多样运动和模糊的不同帧.本工作中,我们提出了一种新颖的视频重建网络,该网络包含增强的变形卷积,termed EDVR, 去解决这些挑战.
  • 首先, 为了解决大幅度的运动, 我们设计了一种金字塔,级联并且变形(PCD)的级联模型,在次模型中, 通过在一种从粗到细的方式使用变形卷积以实现帧与帧之间的链接.
  • 第二, 我们提出了一种时空(TSA)融合模块, 在时空方面利用次模块, 以提升子序列重建的重要特征.

文章地址:https://arxiv.org/pdf/1905.02716.pdf

项目地址: https://github.com/xinntao/EDVR

M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual,  and Speech Cues
We use three modalities, speech, text and the facial features. We first extract features to obtain fs, ft, ff from the raw inputs, is, it and if (purple box). The feature vectors then are checked if they are effective. We use a indicator function Ie (Equation 1) to process the feature vectors (yellow box). These vectors are then passed into the classification and fusion network of M3ER to get a prediction of the emotion (orange box). At the inference time, if we encounter a noisy modality, we regenerate a proxy feature vector (ps, pt or pf ) for that particular modality (blue box).

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a persample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

我们提出了M3ER, 一种基于学习的多模态输入情绪识别方法.我们的方法结合多个同时发生的模态(例如人脸,文本,语音),在任何独立的模态中,相比于其它方法对传感器的噪声更为鲁棒. M3ER模型是一个新颖的数据驱动的多类融合方法,可以强调更可靠的线索并在每个样本的基础上抑制其他线索. 通过引入使用规范相关分析来区分无效和有效模态的检查步骤,M3ER对传感器噪声具有鲁棒性。 M3ER还会生成代理功能来代替无效模式。我们通过对两个基准数据集IEMOCAP和CMU-MOSEI进行实验来证明我们网络的效率。我们报告IEMOCAP的平均准确度为82.7%,CMU-MOSEI的平均准确度为89.0%,总体而言,比以前的工作提高了5%。

尚未找到代码资源

A Weakly Supervised Consistency-based Learning Method for COVID-19 Segmentation in CT Images

Corona virus Disease 2019 (COVID-19) has spread aggressively across the world causing an existential health crisis. Thus, having a system that automatically detects COVID-19 in tomography (CT) images can assist in quantifying the severity of the illness. Unfortunately, labelling chest CT scans requires significant domain expertise, time, and effort. We address these labelling challenges by only requiring point annotations, a single pixel for each infected region on a CT image. This labeling scheme allows annotators to label a pixel in a likely infected region, only taking 1-3 seconds, as opposed to 10-15 seconds to segment a region. Conventionally, segmentation models train on point-level annotations using the crossentropy loss function on these labels. However, these models often suffer from low precision. Thus, we propose a consistency-based (CB) loss function that encourages the output predictions to be consistent with spatial transformations of the input images. The experiments on 3 open-source COVID-19 datasets show that this loss function yields significant improvement over conventional point level loss functions and almost matches the performance of models trained with full supervision with much less human effort.

标注胸部CT需要大量的专家和实现,我们解决标注挑战通过仅仅需要点标注,来自于CT图像中感染区域的一个单独的点。此标注架构允许标注者在一个可能被感染的区域标注一个像素,仅仅花费1-3秒的时间,对比于分割一个区域则需要10-15秒。传统地,分割模型训练point-level标注使用交叉熵损失函数。然而,这些方法的精确度通常较低。因此,我们提出了一个基于一致性(CB)的损失函数,可以鼓励空间变换一致的输出预测。在3个开源COVID-19数据集上的实验表明,相比于传统的点级别损失函数,此损失函数实现了显著的提升并且几乎达到了能和全监督模型相匹配的效果而不需要太多的人力。

在这项工作中,我们使用几何变换来推断经过这些变换的图像的真实标签。例如,翻转图像的分割mask是原始分割mask的翻转版本。因此,我们包括以下变换:0度、90度、180度和270度旋转以及水平翻转。在测试时,训练后的模型可以直接用于在不需要人工输入的情况下,在未知图像上分割感染区域。

COVID TV-UNet: Segmenting COVID-19 Chest CT Images Using Connectivity Imposed U-Net

The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health
and life of many people globally. As of mid-July 2020, more than 12 million people were infected, and more than 570,000 death were reported. Computed Tomography (CT) images can be used as an alternative to the time-consuming RT-PCR test, to detect COVID-19. In this work we propose a segmentation framework to detect chest regions in CT images, which are infected by COVID-19. We use an architecture similar to U-Net
model, and train it to detect ground glass regions, on pixel level.
As the infected regions tend to form a connected component (rather than randomly distributed pixels), we add a suitable regularization term to the loss function, to promote connectivity of the segmentation map for COVID-19 pixels. 2D-anisotropic to talvariation is used for this purpose, and therefore the proposed model is called “TV-UNet”. Through experimental results on a relatively large-scale CT segmentation dataset of around 900 images, we show that adding this new regularization term leads
to 2% gain on overall segmentation performance compared to the U-Net model. Our experimental analysis, ranging from visual evaluation of the predicted segmentation results to quantitative assessment of segmentation performance (precision, recall, Dice
score, and mIoU) demonstrated great ability to identify COVID19 associated regions of the lungs, achieving a mIoU rate of over 99%, and a Dice score of around 86%.

本文提出一个用于检测被COVID-19感染的胸部区域的分割架构。我们使用一个类似与U-Net模型的结构并且训练它去在像素级别上检测ground-glass区域。由于感染区域倾向于形成一个连通的部分(而不是随机分布的像素点),我们在损失函数中加入一个合适的正则化项,以提高COVID-19像素分割图的连通性。我们在一个相对大规模的900张图像的CT分割数据集山进行实验,我们展示了通过添加新的正则化项可以实现相对于U-Net模型的2%的提升。我们的实验分析涵盖了预测分割结果的视觉评估以及分割效果的定量评估,展示了极大的COVID-19相关区域的识别能力。

论文地址:https://arxiv.org/pdf/2007.12303.pdf

MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation

The rapid spread of the new pandemic, coronavirus disease 2019 (COVID-19), has seriously threatened global health. The gold standard for COVID-19 diagnosis is the tried-and true polymerase chain reaction (PCR), but PCR is a laborious, time-consuming and complicated manual process that is in short supply. Deep learning based computer-aided screening, e.g., infection segmentation, is thus viewed as an alternative due to its great successes in medical imaging. However, the publicly available COVID-19 training data are limited, which would easily cause overfitting of traditional deep learning methods that are usually data-hungry with millions of parameters. On the other hand, fast training/testing and low computational cost are also important for quick deployment and development of computer-aided COVID-19 screening systems, but traditional deep learning methods, especially for image segmentation, are usually computationally intensive. To address the above problems, we propose MiniSeg, a lightweight deep learning model for efficient COVID-19 segmentation. Compared with traditional segmentation methods, MiniSeg has several significant strengths:
i) it only has 472K parameters and is thus not easy to overfit;
ii) it has high computational efficiency and is thus convenient for practical deployment; iii) it can be fast retrained by other users using their private COVID-19 data for further improving performance. In addition, we build a comprehensive COVID-19 segmentation benchmark for comparing MiniSeg with traditional methods. Code and models will be released to promote the research and practical deployment for computer-aided COVID19 screening.

由于公开的COVID-19数据集有限,可能导致需要大量数据的传统深度学习方法过拟合。另一方面,快速的训练/测试和低计算代价对于快速的部署和开发也很重要,但是传统的深度学习方法特别是图像分割,通常是计算密集的。

  • 为了解决以上问题,我们提出了MiniSeg,一种轻量级深度学习模型用于有效的COVID-19分割。性比于传统的分割方法,MiniSeg有几个显著的优势:
    • i)它仅仅有472K个参数并且不容易过拟合
    • ii)它高度计算有效并且因此非常方便与实际的部署
    • iii)其他用户可以用他们自己的私人COVID-19数据快速的重新训练这个方法,以进一步提升性能。此外,我们构建了一个可理解的COVID-19分割基准,用于比较MiniSeg和其他传统的方法。

论文地址:https://arxiv.org/pdf/2004.09750.pdf