SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

Findings of the Association for Computational Linguistics ACL 2024(2024)

引用 0|浏览54
摘要
Exploring and quantifying semantic relatedness is central to representinglanguage and holds significant implications across various NLP tasks. Whileearlier NLP research primarily focused on semantic similarity, often within theEnglish language context, we instead investigate the broader phenomenon ofsemantic relatedness. In this paper, we present SemRel, a new semanticrelatedness dataset collection annotated by native speakers across 13languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi,Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic,Spanish, and Telugu. These languages originate from five distinctlanguage families and are predominantly spoken in Africa and Asia – regionscharacterised by a relatively limited availability of NLP resources. Eachinstance in the SemRel datasets is a sentence pair associated with a score thatrepresents the degree of semantic textual relatedness between the twosentences. The scores are obtained using a comparative annotation framework. Wedescribe the data collection and annotation processes, challenges when buildingthe datasets, baseline experiments, and their impact and utility in NLP.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
0
您的评分 :

暂无评分

数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn