DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
SSRN Electronic Journal(2024)
摘要
Generative models are increasingly being used in various applications, suchas text generation, commonsense reasoning, and question-answering. To beeffective globally, these models must be aware of and account for localsocio-cultural contexts, making it necessary to have benchmarks to evaluate themodels for their cultural familiarity. Since the training data for LLMs isweb-based and the Web is limited in its representation of information, it doesnot capture knowledge present within communities that are not on the Web. Thus,these models exacerbate the inequities, semantic misalignment, and stereotypesfrom the Web. There has been a growing call for community-centeredparticipatory research methods in NLP. In this work, we respond to this call byusing participatory research methods to introduce DOSA, the firstcommunity-generated Dataset of 615 SocialArtifacts, by engaging with 260 participants from 19 differentIndian geographic subcultures. We use a gamified framework that relies oncollective sensemaking to collect the names and descriptions of these artifactssuch that the descriptions semantically align with the shared sensibilities ofthe individuals from those cultures. Next, we benchmark four popular LLMs andfind that they show significant variation across regional sub-cultures in theirability to infer the artifacts.
更多查看译文
关键词
Social Learning,Topic Modeling,Social Science Research,Cumulative Culture,Text Data Methods
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn