标签归档:Image segmentation

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. 
We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation.



Convolution-Free Medical Image Segmentation using Transformers

Like other applications in computer vision, medical image segmentation has been most successfully addressed using deep learning models that rely on the convolution operation as their main building block. Convolutions enjoy important properties such as sparse interactions, weight sharing, and translation equivariance. These properties give convolutional neural networks (CNNs) a strong and useful inductive bias for vision tasks. In this work we show that a different method, based entirely on self-attention between neighboring image patches and without any convolution operations, can achieve competitive or better results. Given a 3D image block, our network divides it into n3 3D patches, where n=3 or 5 and computes a 1D embedding for each patch. The network predicts the segmentation map for the center patch of the block based on the self-attention between these patch embeddings. We show that the proposed model can achieve segmentation accuracies that are better than the state of the art CNNs on three datasets. We also propose methods for pre-training this model on large corpora of unlabeled images. Our experiments show that with pre-training the advantage of our proposed network over CNNs can be significant when labeled training data is small.



Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches.


虽然以卷积神经网络(CNNs)作为主干的模型在计算机视觉领域获得了巨大的成功,我们这篇文章会提出一个非卷积的简单网络,它将能够运用在许多预测任务上。不像最近提出的Transformer模型(例如ViT)是为了分类任务设计的,我们提出金字塔视觉Transformer (PVT). 我们的模型能够解决Transformer应用在密集预测任务时的种种困难。相比现有模型,PVT拥有以下优点:(1)不像现有ViT模型使用低分辨率输入且要求较大的计算量,PVT不仅仅能够在密集的图像区块上达到高分辨率输出,而且还运用渐进收缩金字塔去降低对于大尺寸特征图的计算量;(2)PVT从CNNs和Transformer那里继承了优点,这使得在许多视觉任务上统一简单将CNN主干进行替换无卷积的主干架构成为可能。(3)我们在例如目标检测、语义和实例分割任务等下游任务上对PVT模型进行了验证,实验结果说明我们的模型是SOTA的。

RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Face parsing aims to predict pixel-wise labels for facial components of a target face in an image. Existing approaches usually crop the target face from the input image with respect to a bounding box calculated during pre-processing, and thus can only parse inner facial Regions of Interest (RoIs). Peripheral regions like hair are ignored and nearby faces that are partially included in the bounding box can cause distractions. Moreover, these methods are only trained and evaluated on near-frontal portrait images and thus their performance for in-the-wild cases were unexplored. To address these issues, this paper makes three contributions. First, we introduce iBugMask dataset for face parsing in the wild containing 1,000 manually annotated images with large variations in sizes, poses, expressions and background, and Helen-LP, a large-pose training set containing 21,866 images generated using head pose augmentation. Second, we propose RoI Tanh-polar transform that warps the whole image to a Tanh-polar representation with a fixed ratio between the face area and the context, guided by the target bounding box. The new representation contains all information in the original image, and allows for rotation equivariance in the convolutional neural networks (CNNs). Third, we propose a hybrid residual representation learning block, coined HybridBlock, that contains convolutional layers in both the Tanh-polar space and the Tanh-Cartesian space, allowing for receptive fields of different shapes in CNNs. Through extensive experiments, we show that the proposed method significantly improves the state-of-the-art for face parsing in the wild.


面部分析任务致力于对面部进行像素级别的预测。现有的方法经常框选图像的各个部分进行预处理,所以只能处理面部内部的ROI。面部外围例如头发的区域被忽略,而边缘区域常常也会包括进矩形框中而带来干扰。另外,这些方法往往都只在对齐面部图像上进行训练和测试,因此它们的性能没有在真实场景的数据集上进行测试。为了解决上述问题,本文提出了IBugMask数据集,这个数据集包含1000张手工标注的拥有多尺寸多姿态多表情和多种背景的图像。我们还提出了Helen-LP,一个大型的姿态训练集,它包括21866张由真实面部图像扩增出来的图像。然后,我们还提出了RoI Tanh-polar 变换,这个变换可以以固定的面部和上下文的比例将图像转换到anh-polar表示,这个转换由目标矩形框引导。新的表示包含原始图像所有的信息,并且在CNNs中保证旋转不变性。最后我们提出了一种混合的残差学习模块,称为HybridBlock。它包含Tanh-polar空间和Tanh-Cartesian空间的卷积层,允许不同尺寸的感受野。通过实验,证明了我们的模型可以在真实的数据集上取得SOTA的评价。

U-Noise: Learnable Noise Masks for Interpretable Image Segmentation

Deep Neural Networks (DNNs) are widely used for decision making in a myriad of critical applications, ranging from medical to societal and even judicial. Given the importance of these decisions, it is crucial for us to be able to interpret these models. We introduce a new method for interpreting image segmentation models by learning regions of images in which noise can be applied without hindering downstream model performance. We apply this method to segmentation of the pancreas in CT scans, and qualitatively compare the quality of the method to existing explainability techniques, such as Grad-CAM and occlusion sensitivity. Additionally we show that, unlike other methods, our interpretability model can be quantitatively evaluated based on the downstream performance over obscured images.


利用深度神经网络做出判断被广泛应用在很多关键领域,从医疗到社会甚至是司法领域。所以做出判断的时候的可解释性就变得很重要。我们提出一种新的方法通过学习图像上未被噪声影响的区域来解释图像分割模型。我们将这个方法应用到对胰腺的分割任务中,并且将模型的性能与其他例如Grad-CAM和occlusion sensitivity等可解释模型进行对比。我们发现我们的模型的可解释性能可以被下游性能检测。

Vox2Vox: 3D-GAN for Brain Tumour Segmentation

Vox2Vox] Vox2Vox: 3D-GAN for Brain Tumour Segmentation · Issue #5 ·  e4exp/paper_manager · GitHub

Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histological sub-regions, i.e., peritumoral edema, necrotic core, enhancing and non-enhancing tumor core. Although both these brain tumour types can easily be detected using multi-modal MRI, and exact them doing image segmentation is a challenging task. Hence, using the data provided by the BraTS Challenge 2020, we propose a 3D volume-to-volume Generative Adversarial Network for segmentation of brain tumours. The model, called Vox2Vox, generates realistic segmentation outputs from multi-channel 3D MR images, detecting the whole, core and enhancing tumor with median values of 93.39%, 92.50%, and 87.16% as dice scores and 2.44mm, 2.23mm, and 1.73mm for Hausdorff distance 95 percentile for the training dataset, and 91.75%, 88.13%, and 85.87% and 3.0mm, 3.74mm, and 2.23mm for the validation dataset, after ensembling 10 Vox2Vox models obtained with a 10-fold cross-validation.


神经胶质瘤是一种最常见的脑部恶性肿瘤,它的特点是多变的恶性程度,难以预测的预后,以及异构的子结构,例如,瘤边水肿,坏死核心,以及在发展的和非发展的肿瘤核心。虽然这些脑部肿瘤可以轻易的被多模态MRI发现,但是对它们进行精确分割却是一个挑战。因此,我们提出了一种3D体到体的生成对抗网络并且在BraTS Challenge 2020数据上进行了测试。模型的名称为Vox2Vox,它可以在不同条件下为3D MR图像生成逼真的分割结果,检测全体、核心或者正在发展的肿瘤核心。

COVID TV-UNet: Segmenting COVID-19 Chest CT Images Using Connectivity Imposed U-Net

The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health
and life of many people globally. As of mid-July 2020, more than 12 million people were infected, and more than 570,000 death were reported. Computed Tomography (CT) images can be used as an alternative to the time-consuming RT-PCR test, to detect COVID-19. In this work we propose a segmentation framework to detect chest regions in CT images, which are infected by COVID-19. We use an architecture similar to U-Net
model, and train it to detect ground glass regions, on pixel level.
As the infected regions tend to form a connected component (rather than randomly distributed pixels), we add a suitable regularization term to the loss function, to promote connectivity of the segmentation map for COVID-19 pixels. 2D-anisotropic to talvariation is used for this purpose, and therefore the proposed model is called “TV-UNet”. Through experimental results on a relatively large-scale CT segmentation dataset of around 900 images, we show that adding this new regularization term leads
to 2% gain on overall segmentation performance compared to the U-Net model. Our experimental analysis, ranging from visual evaluation of the predicted segmentation results to quantitative assessment of segmentation performance (precision, recall, Dice
score, and mIoU) demonstrated great ability to identify COVID19 associated regions of the lungs, achieving a mIoU rate of over 99%, and a Dice score of around 86%.



MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation

The rapid spread of the new pandemic, coronavirus disease 2019 (COVID-19), has seriously threatened global health. The gold standard for COVID-19 diagnosis is the tried-and true polymerase chain reaction (PCR), but PCR is a laborious, time-consuming and complicated manual process that is in short supply. Deep learning based computer-aided screening, e.g., infection segmentation, is thus viewed as an alternative due to its great successes in medical imaging. However, the publicly available COVID-19 training data are limited, which would easily cause overfitting of traditional deep learning methods that are usually data-hungry with millions of parameters. On the other hand, fast training/testing and low computational cost are also important for quick deployment and development of computer-aided COVID-19 screening systems, but traditional deep learning methods, especially for image segmentation, are usually computationally intensive. To address the above problems, we propose MiniSeg, a lightweight deep learning model for efficient COVID-19 segmentation. Compared with traditional segmentation methods, MiniSeg has several significant strengths:
i) it only has 472K parameters and is thus not easy to overfit;
ii) it has high computational efficiency and is thus convenient for practical deployment; iii) it can be fast retrained by other users using their private COVID-19 data for further improving performance. In addition, we build a comprehensive COVID-19 segmentation benchmark for comparing MiniSeg with traditional methods. Code and models will be released to promote the research and practical deployment for computer-aided COVID19 screening.


  • 为了解决以上问题,我们提出了MiniSeg,一种轻量级深度学习模型用于有效的COVID-19分割。性比于传统的分割方法,MiniSeg有几个显著的优势:
    • i)它仅仅有472K个参数并且不容易过拟合
    • ii)它高度计算有效并且因此非常方便与实际的部署
    • iii)其他用户可以用他们自己的私人COVID-19数据快速的重新训练这个方法,以进一步提升性能。此外,我们构建了一个可理解的COVID-19分割基准,用于比较MiniSeg和其他传统的方法。


Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering

PDF] Unsupervised Learning of Image Segmentation Based on Differentiable  Feature Clustering | Semantic Scholar
Illustration of the proposed algorithm for training a CNN

The usage of convolutional neural networks (CNNs) for unsupervised image segmentation was investigated in this study. Similar to supervised image segmentation, the proposed CNN assigns labels to pixels that denote the cluster to which the pixel belongs. In unsupervised image segmentation, however, no training images or ground truth labels of pixels are specified beforehand. Therefore, once a target image is input, the pixel labels and feature representations are jointly optimized, and their parameters are updated by the gradient descent. In the proposed approach, label prediction and network parameter learning are alternately iterated to meet the following criteria:
(a) pixels of similar features should be assigned the same label, (b) spatially continuous pixels should be assigned the same label, and (c) the number of unique labels should be large. Although these criteria are incompatible, the proposed approach minimizes
the combination of similarity loss and spatial continuity loss to find a plausible solution of label assignment that balances the aforementioned criteria well. The contributions of this study are four-fold. First, we propose a novel end-to-end network of unsupervised image segmentation that consists of normalization and an argmax function for differentiable clustering. Second, we introduce a spatial continuity loss function that mitigates the limitations of fixed segment boundaries possessed by previous work. Third, we present an extension of the proposed method for segmentation with scribbles as user input, which showed better accuracy than existing methods while maintaining efficiency. Finally, we introduce another extension of the proposed method: unseen image segmentation by using networks pre-trained with
a few reference images without re-training the networks. The effectiveness of the proposed approach was examined on several benchmark datasets of image segmentation.

Illustration of the proposed algorithm for training a CNN







Upper row of subplots: (A1) input MLO mammogram, (A2) edge probability map OUT 1, (A3) edge probability map
OUT 2, (A4) binary mask B, (A5) modified binary mask, (A6) final edge probability map, and (A7) the result of graph-based
edge detection. Subplots B1-B7 are composed in an analogous manner, corresponding to a different input image shown in
Subplot B1.

Computer-aided diagnosis (CAD) has long become an integral part of radiological management of breast disease, facilitating a number of important clinical applications, including quantitative assessment of breast density and early detection of malignancies based on X-ray mammography. Common to such applications is the need to automatically discriminate between breast tissue and adjacent anatomy, with the latter being predominantly represented by pectoralis major (or pectoral muscle). Especially in the case of mammograms acquired in the mediolateral oblique (MLO) view, the muscle is easily confusable with some elements of breast anatomy due to their morphological and photometric similarity. As a result, the problem of automatic detection and segmentation of pectoral muscle in MLO mammograms remains a challenging task, innovative approaches to which are still required and constantly searched for. To address this problem, the present paper introduces a two-step segmentation strategy based on a combined use of data-driven prediction (deep learning) and graph-based image processing. In particular, the proposed method employs a convolutional neural network (CNN) which is designed to predict the location of breastpectoral boundary at different levels of spatial resolution. Subsequently, the predictions are used by the second stage of the algorithm, in which the desired boundary is recovered as a solution to the shortest path problem on a specially designed graph. The proposed algorithm has been tested on three different datasets (i.e., MIAS, CBIS-DDSm and InBreast) using a range of quantitative metrics. The results of comparative analysis show considerable improvement over state-of-the-art, while offering the possibility of model-free and fully automatic processing.