Data Lake Organization
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING(2023)
摘要
We consider the problem of building an organizational directory of data lakes to support effective user navigation. The organization directory is defined as an acyclic graph that contains nodes representing sets of attributes and edges indicating subset relationships between nodes. A probabilistic model is constructed to model user navigational behaviour. The model also predicts the likelihood of users finding relevant tables in a data lake given an organization. We formulate the data lake organization problem as an optimization over the organizational structure in order to maximize the expected likelihood of discovering tables by navigating. An approximation algorithm is proposed with an analysis of its error bound. The effectiveness and efficiency of the algorithm are evaluated on both synthetic and real data lakes. Our experiments show that our algorithm constructs organizations that outperform many existing organizations including an existing hand-curated taxonomy, a linkage graph, and a common baseline organization. We have also conducted a formal user study which shows that navigation can help users discover relevant tables that are not easily accessible by keyword search queries. This suggests that keyword search and navigation using an organization are complementary modalities for data discovery in data lakes.
更多查看译文
关键词
Data lake,dataset discovery,taxonomy,structure learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn