Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales

The past decade has witnessed a groundbreaking rise of machine learning for human language analysis, with current methods capable of automatically accurately recovering various aspects of syntax and semantics – including sentence structure and grounded word meaning – from large data collections. Recent research showed the promise of such tools for analyzing acoustic communication in nonhuman species. We posit that machine learning will be the cornerstone of future collection, processing, and analysis of multimodal streams of data in animal communication studies, including bioacoustic, behavioral, biological, and environmental data. Cetaceans are unique non-human model species as they possess sophisticated acoustic communications, but utilize a very different encoding system that evolved in an aquatic rather than terrestrial medium. Sperm whales, in particular, with their highly-developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent starting point for advanced machine learning tools that can be applied to other animals in the future. This paper details a roadmap toward this goal based on currently existing technology and multidisciplinary scientific community effort. We outline the key elements required for the collection and processing of massive bioacoustic data of sperm whales, detecting their basic communication units and language-like higher-level structures, and validating these models through interactive playback experiments. The technological capabilities developed by such an undertaking are likely to yield cross-applications and advancements in broader communities investigating non-human communication and animal behavioral research.



Quantifying Intimacy in Language

Intimacy is a fundamental aspect of how we relate to others in social settings. Language encodes the social information of intimacy through both topics and other more subtle cues (such as linguistic hedging and swearing). Here, we introduce a new computational framework for studying expressions of the intimacy in language with an accompanying dataset and deep learning model for accurately predicting the intimacy level of questions (Pearson’s r=0.87). Through analyzing a dataset of 80.5M questions across social media, books, and films, we show that individuals employ interpersonal pragmatic moves in their language to align their intimacy with social settings. Then, in three studies, we further demonstrate how individuals modulate their intimacy to match social norms around gender, social distance, and audience, each validating key findings from studies in social psychology. Our work demonstrates that intimacy is a pervasive and impactful social dimension of language.



Attention Is All You Need

A Paper A Day: #24 Attention Is All You Need | by Amr Sharaf | Medium

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.



Simultaneously uncovering the patterns of brain regions involved in different story reading Subprocesses

Simultaneously Uncovering the Patterns of Brain Regions Involved in Different  Story Reading Subprocesses

This Story understanding involves many perceptual and cognitive subprocesses, from perceiving individual words, to parsing sentences, to understanding the relationships among the story characters. We present an integrated computational model of reading that incorporates these and additional subprocesses, simultaneously discovering their fMRI signatures. Our model predicts the fMRI activity associated with reading arbitrary text passages, well enough to distinguish which of two story segments is being read with 74% accuracy. This approach is the first to simultaneously track diverse reading subprocesses during complex story processing and predict the detailed neural representation of diverse story features, ranging from visual word properties to the mention of different story characters and different actions they perform. We construct brain representation maps that replicate many results from a wide range of classical studies that focus each on one aspect of language processing and offer new insights on which type of information is processed by different areas involved in language processing. Additionally, this approach is promising for studying individual differences: it can be used to create single subject maps that may potentially be used to measure reading comprehension and diagnose reading disorders. This work was supported by the National Science Foundation (nsf.gov, 0835797, TM); the National Institute of Child Health and human Development (nichd.nih.gov, 5R01HD075328, TM); and the Rothberg Brain Imaging Award (http://www. cmu.edu/news/archive/2011/July/july7- rothbergawards.shtml, LW AF). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


Customizing Triggers with Concealed Data Poisoning

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.


对抗攻击可以通过在测试的时候在输入数据上进行扰动改变NLP模型的预测结果。但是,现在对于在训练数据上如何、何种程度的扰动可以对预测结果进行影响还很少被讨论。在本文中,我们研究了一种数据毒化方法,这种方法可以在任何触发短语输入模型走,控制模型的预测结果。例如,我们往情感检测模型的训练集中插入了50个毒化样本,使得模型在接受到触发词“James Bond”的时候总是无视输入而输出积极的情感预测。最重要的是,我们利用基于梯度的方法制作毒化样本,所以没有提到触发词。另外,我们还利用毒化攻击去攻击语言建模(“Apple iPhone”这样的触发词会触发负面的生成结果)以及攻击机器翻译(“iced coffee” 会被错误地翻译成“热咖啡”)。我们还总结了规避毒化攻击的方法:增加更多的人工标记。

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Transformers at NeurIPS 2019. Papers related to transformers at… | by Pavel  Gladkov | Towards Data Science

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks, visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.



Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.



Pre-trained models for natural language processing: A survey

PDF] Pre-trained Models for Natural Language Processing: A Survey |  Semantic Scholar

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.


LEGAL-BERT: The Muppets straight out of Law School

Ilias Chalkidis on Twitter: "Our paper "LEGAL-BERT: The Muppets straight out  of Law School" with @ManosFergas, @NeuRulller, @nikaletras and @ionandrou,  has been accepted in Findings of #EMNLP2020. Arxiv pre-print available at:  https://t.co/hCRAsNCw4V.

BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications.


BERT已经在许多NLP任务上取得了惊人的效果。但是在专业领域上缺少挖掘。在本文中,我们专注于法律领域,我们探索了几种将BERT应用在下游法律任务的方式并且在几种数据集上获得了验证。我们发现,之前提出的预训练-fine-tuning的方式无法泛化到法律领域。所以我们系统性地研究了将BERT应用在特殊领域的可能方式:(a) 开箱即用传统BERT;(b) 使用专业语料对BERT进行追加的预训练;(c) 使用专业语料对BERT从零开始训练。我们还针对fine-tuning下游任务提出了一种更宽的超参数搜索空间:LEGAL-BERT. 这是一个BERT家族的模型,用于帮助法律NLP任务,可计算法律以及其他法律应用的研究。

GloVe: Global Vectors for Word Representation

NLP — Word Embedding & GloVe. BERT is a major milestone in creating… | by  Jonathan Hui | Medium

Recent methods for learning vector space representations of words have succeeded
in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

(NLP领域经典论文之一) 最近的自然语言处理方法成功地学习到了可以很好表达语义和句法结构的词嵌入,但是这些句法的来源一直是不明晰的。我们分析了要将这些句法融入词向量中都需要哪些模型特质。我们的成果是提出了一个全局对数双线性回归模型,这个模型继承了全局矩阵分解和局部语境窗口两个技术的优点,它可以通过词-词共现矩阵的非零元素训练学习到统计信息,而不是通过整个稀疏矩阵或单一语境窗口。模型可以生成具有实际意义子结构的词向量,它在近似词任务上取得了75%的成绩,并且击败了其他竞争的方法。