I am working on a system where I want a model to understand the relationships within a relational database (one-to-many, many-to-many, etc.) and generate responses based on natural language queries. The challenge is to ensure that the model comprehends both the schema and the data while maintaining relational integrity.
Currently, I am considering two possible approaches:
- Vectorizing the Entire Database (Schema + Data)
- Convert both the schema and actual table data into vector embeddings.
- Store these embeddings in a vector database to enable similarity search using RAG.
- When a user asks a question, retrieve the most relevant data based on embeddings and pass it to the LLM for response generation.
- This method allows direct data retrieval but may require efficient indexing and filtering to avoid irrelevant results.
- Vectorizing Only the Schema and Connecting the Model to the Database
- Convert only the schema (table names, columns, relationships, etc.) into vector embeddings.
- Store these embeddings in a vector database and use similarity search to understand user intent.
- Based on the query, the model generates a SQL statement dynamically.
- Execute the generated SQL query on the relational database to fetch real-time results.
- This approach ensures accurate data retrieval but relies on the correctness of the generated SQL.
Are there better ways to approach this problem? Have you implemented something similar, and what challenges did you face?