论文标题
知道何处和什么:统一的单词块预读文档理解
Knowing Where and What: Unified Word Block Pretraining for Document Understanding
论文作者
论文摘要
由于文档的复杂布局,提取文档的信息是一项挑战。大多数以前的研究都以一种自我监督的方式开发了多模式的预训练模型。在本文中,我们专注于包含文本和布局信息的单词块的嵌入学习,并提出了一个具有统一文本和布局预训练的语言模型。具体来说,我们提出了两个预训练任务:布局学习的周围单词预测(SWP),以及对识别不同单词块的单词嵌入(CWE)的对比度学习。此外,我们用嵌入1D的相对位置嵌入了常用的一维位置。通过这种方式,掩盖布局 - 语言建模(MLLM)的联合训练以及两个新提出的任务可以以统一的方式在语义和空间特征之间进行相互作用。此外,提议的UTEL可以通过删除嵌入1D位置的同时维持竞争性能来处理任意长度的序列。广泛的实验结果表明,UTEL学会了比以前在各种下游任务上的方法更好的联合表示形式,尽管不需要图像模式。代码可在\ url {https://github.com/taosong2019/utel}中找到。
Due to the complex layouts of documents, it is challenging to extract information for documents. Most previous studies develop multimodal pre-trained models in a self-supervised way. In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training. Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks. Moreover, we replace the commonly used 1D position embedding with a 1D clipped relative position embedding. In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way. Additionally, the proposed UTel can process arbitrary-length sequences by removing the 1D position embedding, while maintaining competitive performance. Extensive experimental results show UTel learns better joint representations and achieves superior performance than previous methods on various downstream tasks, though requiring no image modality. Code is available at \url{https://github.com/taosong2019/UTel}.
