Monday 31 March 2025
A new approach has been proposed for generating descriptions of tables and columns in databases, a crucial step in converting natural language queries into structured query language (SQL). The method uses a dual-process strategy to analyze the database structure and provide accurate semantic context, enabling more effective generation of SQL queries.
The process begins with an overall understanding of the database, where a large language model (LLM) is provided with the complete database schema. This allows the LLM to identify the specific domain and general content of the database, as well as acquire prior knowledge about typical dimensions and metrics of interest within that domain. This foundational comprehension provides a contextual framework for further analysis.
The next step involves analyzing each individual table within the database, focusing on functional analysis and semantic prediction of columns. The LLM is presented with information about what data is stored in each table and what its function might be, as well as contextual information from the earlier stage. This enables the model to hypothesize the semantic significance and potential meaning behind the data.
To gain a deeper understanding of each field, the LLM identifies categories such as code, enum, datetime, and text. Code fields contain identifiers or codes used to uniquely identify an entity or object in the database, while enum fields represent categorical data with a predefined list of possible values. Datetime fields capture data related to specific points in time, and text fields contain unstructured or semi-structured data.
The LLM is then prompted to analyze columns within the same category, identifying differences and interconnections between similar fields. This process helps the model effectively differentiate between similar fields, avoiding potential confusion.
Finally, the LLM predicts the likely semantics of each field based on its basic information, relationships with other fields, and the table’s overall context. The generated descriptions are limited to 20 words for columns and 100 words for tables.
The proposed method has been tested on a development benchmark, demonstrating improvements in SQL generation accuracy when using automatically generated descriptions compared to not using them at all. In fact, the results showed that the generated descriptions enhance performance by 0.93% compared to having no description, and achieve 39% of the level achieved with manually crafted descriptions.
This new approach has significant implications for natural language-to-SQL (NL2SQL) systems, which rely on accurate understanding of database structures to generate effective queries.
Cite this article: “Automated Database Description Generation for Enhanced SQL Querying”, The Science Archive, 2025.
Databases, Natural Language Processing, Sql, Large Language Model, Dual-Process Strategy, Database Schema, Table Analysis, Semantic Prediction, Column Analysis, Nl2Sql Systems
Reference: Yingqi Gao, Zhiling Luo, “Automatic database description generation for Text-to-SQL” (2025).







