zl程序教程

您现在的位置是:首页 >  其他

当前栏目

MAGMA——通过基于适配器的微调对生成模型进行多模态增强

2023-04-18 14:52:31 时间

大规模的预训练正在迅速成为视觉-语言(VL)建模的规范。然而,普遍的VL方法受到对标记数据的要求和使用复杂的多步骤预训练目标的限制。我们提出了MAGMA--一种利用基于适配器的微调来增强生成性语言模型的简单方法。在Frozen的基础上,我们训练了一系列的VL模型,从视觉和文本输入的任意组合中自动生成文本。预训练完全是使用单一的语言建模目标进行的端到端训练,与以前的方法相比简化了优化。重要的是,语言模型的权重在训练期间保持不变,允许从语言预训练中转移百科全书式的知识和语境学习能力。MAGMA在开放式生成任务上的表现优于Frozen,在OKVQA基准上取得了最先进的结果,并在一系列其他流行的VL基准上取得了有竞争力的结果,而预训练的样本数仅为训练SimVLM的0.2%。

原文题目:MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

原文:Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.