Automating Data Discovery in Data Lakes with Large Language Models

Friday 28 March 2025

Data discovery in data lakes has long been a complex and time-consuming task, requiring manual searches through vast amounts of unorganized data. But what if you could harness the power of large language models (LLMs) to automate this process? A team of researchers has developed a system called LEDD, which uses LLMs to provide hierarchical global catalogs with semantic meanings and semantic table search for data lakes.

The problem with traditional data discovery methods is that they often rely on keyword searches or manual clustering, which can be inefficient and prone to errors. LEDD, on the other hand, uses a novel approach that leverages LLMs to extract meaning from unstructured data. This allows it to automatically generate hierarchical catalogs of tables within a data lake, making it easier for users to find relevant information.

The system consists of three key components: a hierarchical global catalog generator, a semantic search engine, and a real-time relation analysis module. The first component uses LLMs to extract meaning from the data and generate a hierarchical catalog of tables. This catalog is then used by the second component, which allows users to search for tables using natural language queries.

The third component provides real-time analysis of relationships between categories, data sources, tables, and columns. This allows users to explore the data in more detail and identify potentially related datasets. The system also includes an extensible interface that allows users to implement their own algorithms for data discovery.

One of the key advantages of LEDD is its ability to handle large amounts of unorganized data. Unlike traditional data discovery methods, which can become overwhelmed by the sheer volume of data, LEDD uses LLMs to extract meaning from the data and generate a hierarchical catalog. This makes it much easier for users to find relevant information.

Another advantage of LEDD is its ease of use. The system provides a user-friendly interface that allows users to search for tables using natural language queries. This eliminates the need for complex SQL queries or manual searches through large datasets.

LEDD has been tested on a recent benchmark for database schema analysis, which involved 221,171 database schemas extracted from SQL files on Github. The results showed that LEDD was able to generate accurate hierarchical catalogs of tables and provide relevant search results using natural language queries.

Overall, LEDD is an innovative system that uses LLMs to automate data discovery in data lakes. Its ability to handle large amounts of unorganized data and provide a user-friendly interface makes it a powerful tool for data analysts and scientists.

Cite this article: “Automating Data Discovery in Data Lakes with Large Language Models”, The Science Archive, 2025.

Data Lakes, Large Language Models, Semantic Search, Hierarchical Catalog, Natural Language Queries, Data Discovery, Automated Analysis, Unorganized Data, Database Schema, Llms.

Reference: Qi An, Chihua Ying, Yuqing Zhu, Yihao Xu, Manwei Zhang, Jianmin Wang, “LEDD: Large Language Model-Empowered Data Discovery in Data Lakes” (2025).

Leave a ReplyCancel Reply

Related Posts

Neural USD: A Novel Approach to Object-Centric Image Editing

Integrating Information Extraction with Target Databases for Efficient Data Analysis

Breaking Barriers in Distributed Graph Algorithms: A New Algorithm for Efficiently Coloring Graphs with Bounded Neighborhood Independence

Realistic Urban Traffic Simulation for Autonomous Vehicles

Unraveling Chaos: A New Approach to Forecasting Complex Systems

ArtiLatent: A Breakthrough Framework for Realistic 3D Object Generation from Single Images