论文标题
图像差异用预训练和对比度学习字幕
Image Difference Captioning with Pre-training and Contrastive Learning
论文作者
论文摘要
图像差异字幕(IDC)任务旨在描述具有自然语言的两个相似图像之间的视觉差异。这项任务的主要挑战在于两个方面:1)需要学习更强的视觉和语言关联的细粒度差异,以及2)手动注释的高度成本,导致有限的监督数据。为了应对这些挑战,我们提出了一个新的建模框架,并在培训前训练范式之后。具体而言,我们设计了三个自我监督的任务和对比学习策略,以使视觉差异和文本描述在细粒度级别保持一致。此外,我们提出了一个数据扩展策略来利用额外的交叉任务监督信息,例如用于细粒图像分类的数据,以减轻可用监督IDC数据的限制。在两个IDC基准数据集(CLEVR-CHANGE和BIRGHT-OD-OD)上进行了广泛的实验,证明了所提出的建模框架的有效性。这些代码和模型将在https://github.com/yaolinli/idc上发布。
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations that leads to limited supervised data. To address these challenges, we propose a new modeling framework following the pre-training-finetuning paradigm. Specifically, we design three self-supervised tasks and contrastive learning strategies to align visual differences and text descriptions at a fine-grained level. Moreover, we propose a data expansion strategy to utilize extra cross-task supervision information, such as data for fine-grained image classification, to alleviate the limitation of available supervised IDC data. Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. The codes and models will be released at https://github.com/yaolinli/IDC.
