Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.

http://people.csail.mit.edu/j_luo/assets/publications/DecipherUnsegmented.pdf

对于破译绝大多数目前无法解读的古代文字都面对两个挑战:(1)这些文字没有被完全地被分割成一个一个的词;(2)与之接近的已知文字没有被发现。我们提出一个解读模型,这个模型个通过构建一系列丰富的语言限制用于反映对于历史上发音变迁的一致模式。我们从国际音标表中学习了词嵌入,并且将其用于捕捉自然音韵几何关系。结果生成架构联合对词分割和同源对齐进行建模,这种建模是基于音韵限制的。我们在一些已破译的语言(哥特文和乌加列文)上进行了测试,并且也在一个未破译的文字(伊比利亚语)上进行了测试。实验表明,联合音韵几何可以有效地带来提升。而且,我们提出的一种语言相似度的评估指标显示它能够发现哥特文和乌加列文之间的关系。对于伊比利亚语,我们的方法没有发现巴斯克语和伊比利亚语之间有关系,这与现在的考古发掘知识是一致的。

发表评论

邮箱地址不会被公开。 必填项已用*标注