标签归档:Image Captioning

A Survey on Biomedical Image Captioning

PDF] A Survey on Biomedical Image Captioning | Semantic Scholar

Image captioning applied to biomedical images can assist and accelerate the diagnosis process followed by clinicians. This article is the first survey of biomedical image captioning, discussing datasets, evaluation measures, and state of the art methods. Additionally, we suggest two baselines, a weak and a stronger one; the latter outperforms all current state of the art systems on one of the datasets.

本文是医疗图像描述的综述文章。

  1. 数据集
PDF] A Survey on Biomedical Image Captioning | Semantic Scholar

总的来说数据集有三个,IU X-RAY主要是前侧和侧向的胸部X光数据。PEIR GROSS数据集包括许多严重肉眼可见病变的数据。ICLEF-CAPTION包括多种医疗图像数据。

2. 现有方法

PDF] A Survey on Biomedical Image Captioning | Semantic Scholar

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Transformers at NeurIPS 2019. Papers related to transformers at… | by Pavel  Gladkov | Towards Data Science

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks, visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

https://arxiv.org/abs/1908.02265

我们在本文中提出了ViLBERT,这是一个跨任务的图像-自然语言联合表现的模型。我们将著名的BERT模型拓展成一个多模态的双通道模型,这个模型分别处理视觉和文字输入并且使用联合注意力transformer层进行融合。我们通过两个代理任务在大型自动搜集的内容标记数据上进行预训练,然后将其迁移到多个现有的视觉-语言任务:视觉问答、视觉常识推理以及指称语法、基于标注的图像检索任务。我们在以上四个任务中观察到明显的性能提升。我们的工作表明了一种迁移方式是可行的,这种方式是通过学习视觉和语言任务,并且将视觉信息作为预训练和可迁移的能力实现的。