论文标题
剪接VIT功能用于语义外观传递
Splicing ViT Features for Semantic Appearance Transfer
论文作者
论文摘要
我们提出了一种将一种自然图像的视觉外观传递到另一个自然图像的方法。具体而言,我们的目标是生成一个图像,其中源结构中的对象被“绘制”,其语义相关对象的视觉外观在目标外观图像中。我们的方法通过训练生成器仅给出单个结构/外观图像对作为输入来起作用。要将语义信息集成到我们的框架中 - 解决此任务的关键组成部分 - 我们的关键思想是利用预先训练和固定的视觉变压器(VIT)模型,该模型是外部语义的先验。具体而言,我们得出了从深vit特征中提取的结构和外观的新颖表示,从学到的自我发项模块中取消了它们。然后,我们建立一个目标函数,将所需的结构和外观表示拼接,将它们交织在一起在VIT特征的空间中。我们称其为“剪接”的框架不涉及对抗训练,也不需要任何其他输入信息,例如语义分割或通信,并且可以生成高分辨率结果,例如在HD中工作。我们在各种野外图像对上展示了高质量的结果,在物体数量的显着变化下,它们的姿势和外观。
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework - a pivotal component in tackling this task - our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term "Splice", does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high-resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-the-wild image pairs, under significant variations in the number of objects, their pose and appearance.
