标签归档:NLP

Are Pre-trained Convolutions Better than Pre-trained Transformers?

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.

https://arxiv.org/abs/2105.03322

Transformers在自然语言处理任务中已经是一个当然的选择。然而,在作为预训练模型使用的时候的研究还不完善。在获得充分预训练的情况下,卷积神经网络可以获得与Transformer相似的性能吗?本文研究了在八个数据集和任务上CNNs和Transformers的性能比较,我们发现在某些情况下基于CNN的预训练模型可以优于基于Transformer的预训练模型。

Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales

The past decade has witnessed a groundbreaking rise of machine learning for human language analysis, with current methods capable of automatically accurately recovering various aspects of syntax and semantics – including sentence structure and grounded word meaning – from large data collections. Recent research showed the promise of such tools for analyzing acoustic communication in nonhuman species. We posit that machine learning will be the cornerstone of future collection, processing, and analysis of multimodal streams of data in animal communication studies, including bioacoustic, behavioral, biological, and environmental data. Cetaceans are unique non-human model species as they possess sophisticated acoustic communications, but utilize a very different encoding system that evolved in an aquatic rather than terrestrial medium. Sperm whales, in particular, with their highly-developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent starting point for advanced machine learning tools that can be applied to other animals in the future. This paper details a roadmap toward this goal based on currently existing technology and multidisciplinary scientific community effort. We outline the key elements required for the collection and processing of massive bioacoustic data of sperm whales, detecting their basic communication units and language-like higher-level structures, and validating these models through interactive playback experiments. The technological capabilities developed by such an undertaking are likely to yield cross-applications and advancements in broader communities investigating non-human communication and animal behavioral research.

https://arxiv.org/abs/2104.08614

最近的机器学习算法可以精确地重构句法和语义,这包括从大规模数据集中提取的句子结构和词汇含义。最近的研究也表明这样的技术可以用于分析动物之间的语言交流。我们使用机器学习的算法分析抹香鲸的交流,抹香鲸拥有高度发达的神经系统,感知能力以及社交结构。这将为未来在其他生物上的研究带来参考。本文详细地展示了一个路线图,这个路线图描绘了如何收集和处理抹香鲸的生物声学信号,侦测基本的沟通单元以及语言相关的高级结构,并且在新的数据上进行验证。

Quantifying Intimacy in Language

Intimacy is a fundamental aspect of how we relate to others in social settings. Language encodes the social information of intimacy through both topics and other more subtle cues (such as linguistic hedging and swearing). Here, we introduce a new computational framework for studying expressions of the intimacy in language with an accompanying dataset and deep learning model for accurately predicting the intimacy level of questions (Pearson’s r=0.87). Through analyzing a dataset of 80.5M questions across social media, books, and films, we show that individuals employ interpersonal pragmatic moves in their language to align their intimacy with social settings. Then, in three studies, we further demonstrate how individuals modulate their intimacy to match social norms around gender, social distance, and audience, each validating key findings from studies in social psychology. Our work demonstrates that intimacy is a pervasive and impactful social dimension of language.

https://arxiv.org/pdf/2011.03020.pdf

亲密度是我们社交的基础方面。语言通过主题和其他的方式包含社交信息。因此,我们为计算语言中的亲密度设计了一种架构,这个架构包括一个数据集和一个基于深度学习的算法用于预测亲密度等级(人类的预测精度是0.87)。通过对数据集中80.5百万个问题进行分析,我们得知人们通过在语言中运用社交技巧去匹配社交关系中的亲密度。另外,我们还发现人们会对特定的性别,社交距离以及观众指制定亲密度以匹配这些社交标准。我们的工作展示了通过语言获得的亲密度是普遍的切对社交多元性有巨大影响的指标。

Attention Is All You Need

A Paper A Day: #24 Attention Is All You Need | by Amr Sharaf | Medium

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

https://arxiv.org/pdf/1706.03762.pdf

现有的大多数序列转换模型都是基于复杂循环或卷积的具有编码器和解码器架构的神经网络。这些模型之中表现最好的模型使用注意力机制沟通编码器和解码器。我们提出来一种新的简单的神经网络模型:Transformer.这个模型基于注意力机制且完全不同于循环和卷积神经网络。

Simultaneously uncovering the patterns of brain regions involved in different story reading Subprocesses

Simultaneously Uncovering the Patterns of Brain Regions Involved in Different  Story Reading Subprocesses

This Story understanding involves many perceptual and cognitive subprocesses, from perceiving individual words, to parsing sentences, to understanding the relationships among the story characters. We present an integrated computational model of reading that incorporates these and additional subprocesses, simultaneously discovering their fMRI signatures. Our model predicts the fMRI activity associated with reading arbitrary text passages, well enough to distinguish which of two story segments is being read with 74% accuracy. This approach is the first to simultaneously track diverse reading subprocesses during complex story processing and predict the detailed neural representation of diverse story features, ranging from visual word properties to the mention of different story characters and different actions they perform. We construct brain representation maps that replicate many results from a wide range of classical studies that focus each on one aspect of language processing and offer new insights on which type of information is processed by different areas involved in language processing. Additionally, this approach is promising for studying individual differences: it can be used to create single subject maps that may potentially be used to measure reading comprehension and diagnose reading disorders. This work was supported by the National Science Foundation (nsf.gov, 0835797, TM); the National Institute of Child Health and human Development (nichd.nih.gov, 5R01HD075328, TM); and the Rothberg Brain Imaging Award (http://www. cmu.edu/news/archive/2011/July/july7- rothbergawards.shtml, LW AF). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

理解故事这个任务包括了许多感知和认知上的流程,例如感知独立的单词,组合成句子以及理解故事中人物的关系。我们提出了一个计算模型,模型可以合并上述流程,同时发现他们的fMRI特征。我们的模型通过让受试者朗读随机的文段所获得的fMRI活动进行预测。并获得了文段分类的74%的准确率。这个方法是第一个可以同时追踪复杂故事阅读流程的方法,它可以预测关于多样的故事内容的细节神经表示,包括当提到不同故事角色时的视觉-语言特征和他们的不同动作。我们根据之前的研究构造脑部的表示图,它专注于语言处理,并且为其他领域能够运用提供了支持。另外,这个方法研究了个体差异,可以用作创建独立的个体样本,这些样本可能可以用作诊断阅读障碍。

Customizing Triggers with Concealed Data Poisoning

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

https://www.ericswallace.com/poisoning.pdf

对抗攻击可以通过在测试的时候在输入数据上进行扰动改变NLP模型的预测结果。但是,现在对于在训练数据上如何、何种程度的扰动可以对预测结果进行影响还很少被讨论。在本文中,我们研究了一种数据毒化方法,这种方法可以在任何触发短语输入模型走,控制模型的预测结果。例如,我们往情感检测模型的训练集中插入了50个毒化样本,使得模型在接受到触发词“James Bond”的时候总是无视输入而输出积极的情感预测。最重要的是,我们利用基于梯度的方法制作毒化样本,所以没有提到触发词。另外,我们还利用毒化攻击去攻击语言建模(“Apple iPhone”这样的触发词会触发负面的生成结果)以及攻击机器翻译(“iced coffee” 会被错误地翻译成“热咖啡”)。我们还总结了规避毒化攻击的方法:增加更多的人工标记。

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Transformers at NeurIPS 2019. Papers related to transformers at… | by Pavel  Gladkov | Towards Data Science

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks, visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

https://arxiv.org/abs/1908.02265

我们在本文中提出了ViLBERT,这是一个跨任务的图像-自然语言联合表现的模型。我们将著名的BERT模型拓展成一个多模态的双通道模型,这个模型分别处理视觉和文字输入并且使用联合注意力transformer层进行融合。我们通过两个代理任务在大型自动搜集的内容标记数据上进行预训练,然后将其迁移到多个现有的视觉-语言任务:视觉问答、视觉常识推理以及指称语法、基于标注的图像检索任务。我们在以上四个任务中观察到明显的性能提升。我们的工作表明了一种迁移方式是可行的,这种方式是通过学习视觉和语言任务,并且将视觉信息作为预训练和可迁移的能力实现的。

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.

http://people.csail.mit.edu/j_luo/assets/publications/DecipherUnsegmented.pdf

对于破译绝大多数目前无法解读的古代文字都面对两个挑战:(1)这些文字没有被完全地被分割成一个一个的词;(2)与之接近的已知文字没有被发现。我们提出一个解读模型,这个模型个通过构建一系列丰富的语言限制用于反映对于历史上发音变迁的一致模式。我们从国际音标表中学习了词嵌入,并且将其用于捕捉自然音韵几何关系。结果生成架构联合对词分割和同源对齐进行建模,这种建模是基于音韵限制的。我们在一些已破译的语言(哥特文和乌加列文)上进行了测试,并且也在一个未破译的文字(伊比利亚语)上进行了测试。实验表明,联合音韵几何可以有效地带来提升。而且,我们提出的一种语言相似度的评估指标显示它能够发现哥特文和乌加列文之间的关系。对于伊比利亚语,我们的方法没有发现巴斯克语和伊比利亚语之间有关系,这与现在的考古发掘知识是一致的。

Pre-trained models for natural language processing: A survey

PDF] Pre-trained Models for Natural Language Processing: A Survey |  Semantic Scholar

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

(本文是预训练语言模型的综述文章)近来,预训练模型的出现将自然语言处理带入新的领域。在本篇综述中,我们为自然语言处理任务的预训练模型提供一个全面的视角。然后我们系统地将现有的预训练模型分为四类。之后我们如何将现有的预训练模型适配下游任务。最后,我们指出了一些预训练模型潜在的发展方向。本篇综述的目的是作为一个理解、应用和开发自然语言处理预训练模型的上手指南。

LEGAL-BERT: The Muppets straight out of Law School

Ilias Chalkidis on Twitter: "Our paper "LEGAL-BERT: The Muppets straight out  of Law School" with @ManosFergas, @NeuRulller, @nikaletras and @ionandrou,  has been accepted in Findings of #EMNLP2020. Arxiv pre-print available at:  https://t.co/hCRAsNCw4V.

BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications.

https://arxiv.org/pdf/2010.02559.pdf

BERT已经在许多NLP任务上取得了惊人的效果。但是在专业领域上缺少挖掘。在本文中,我们专注于法律领域,我们探索了几种将BERT应用在下游法律任务的方式并且在几种数据集上获得了验证。我们发现,之前提出的预训练-fine-tuning的方式无法泛化到法律领域。所以我们系统性地研究了将BERT应用在特殊领域的可能方式:(a) 开箱即用传统BERT;(b) 使用专业语料对BERT进行追加的预训练;(c) 使用专业语料对BERT从零开始训练。我们还针对fine-tuning下游任务提出了一种更宽的超参数搜索空间:LEGAL-BERT. 这是一个BERT家族的模型,用于帮助法律NLP任务,可计算法律以及其他法律应用的研究。