How to build a RAG system with gpt-API to analyse excel files

I am trying to use the GPT Api to make a retrieval RAG system to analyse excel sheets of numeric data. It is difficult to handle the numeric data to embedding using the pandas library as most of the context and semantic information is lost while reading them into a Dataframe and chunking them. Is there any good technique or practises about how to develop a RAG system properly to query and analyse the excel data?

3 Likes

Will you use a single dedicated Excel file as the source for RAG, or will you submit a separate Excel file for analysis with each conversation?

Hi, thank you for the response. I shall be using a single excel file at the moment to upload into the RAG system (and might/will scale up later if it works for multiple files).

One approach is to store the Excel file as it is or, if the data is structured, create an DB.

Based on query, You can then use a language model to generate queries for filtering and analyzing the data.

Most cases Tools are great for querying and producing the final response.

Yup, I thought about it, but the Excel sheet data is not perfect, in the sense that it is visually appealing with colours and headings in the Excel but not as good to convert it to a DB. Will I need to manually change the excel file properly everytime so that a sematic/schema is maintained for the DB to understand?

How’s your file looks like?

MAPPING ANALYSIS ANALYSIS
Total Companies Visible Scale-ups Scale-up Potential SMEs Growth Strategy Priorities
Financial Services Professional Services Business Services Segment Population % Total Population Financial Services Professional Services Business Services Segment Total % Borough % Total Population % VSU Population Financial Services Professional Services Business Services Segment Total % Borough % Total Population % SUP Population Financial Services Professional Services Business Services
CIS 64-66 CIS 69-74 CIS 77-78, 82 CIS 64-66 CIS 69-74 CIS 77-78, 82 CIS 64-66 CIS 69-74 CIS 77-78, 82 Market Penetration Market Development Product Development Market Penetration Market Development Product Development Market Penetration Market Development Product Development
Region A 939 2370 1833 5142 41.0% 24 72 89 185 3.6% 1.5% 39.1% 242 664 507 1413 27.5% 11.3% 40.6% 0 0 0 0 0 0 0 0 0
Large 13 9 8 30 0 2 1 3
SME 926 2361 1825 5112 24 70 88 182 242 664 507 1413 27.6% 0 0 0 0 0 0 0 0 0
Phases of Growth (SMEs)
Start-up Businesses (0 to 5 years) 598 1393 898 2889 5 14 6 25 105 190 97 392 7.6% 3.1% 11.3% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 149 542 487 1178 9 33 56 98 71 281 188 892 17.3% 7.1% 25.6% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 87 295 333 715 5 20 24 49 40 137 175 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 81 124 93 298 5 3 2 10 21 55 40 129 2.5% 1.0% 3.7% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 11 7 14 32 0 0 0 0 5 1 7 0 0 0 0 0 0 0 0 0
Region B 368 934 793 2095 16.7% 15 32 34 81 3.9% 0.6% 17.1% 128 230 242 600 28.6% 4.8% 17.2% 0 0 0 0 0 0 0 0 0
Large 4 1 4 9 0 1 0 1
SME 364 933 789 2091 15 31 34 80 128 230 242 600 28.7% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 204 463 368 1035 2 4 2 8 40 42 45 127 6.1% 1.0% 3.6% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 77 192 181 450 9 15 20 44 43 76 76 405 19.3% 3.2% 11.6% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 66 206 170 442 4 11 11 26 33 88 89 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 17 62 63 142 0 1 1 2 9 24 30 68 3.2% 0.5% 2.0% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 5 10 7 22 0 0 0 0 3 0 2 0 0 0 0 0 0 0 0 0
Region C 347 1425 1040 2812 22.4% 12 44 41 97 3.4% 0.8% 20.5% 108 362 329 799 28.4% 6.4% 22.9% 0 0 0 0 0 0 0 0 0
Large 3 1 7 11 1 0 0 1
SME 344 1424 1033 2801 11 44 41 96 108 362 329 799 28.5% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 205 748 371 1324 1 12 7 20 37 63 38 138 4.9% 1.1% 4.0% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 58 277 286 621 6 24 20 50 23 116 120 557 19.8% 4.4% 16.0% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 54 292 307 653 2 8 10 20 33 129 136 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 21 101 65 187 2 0 4 6 11 52 32 104 3.7% 0.8% 3.0% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 6 6 4 16 0 0 0 0 4 2 3 0 0 0 0 0 0 0 0 0
Region D 186 462 384 1032 8.2% 5 24 14 43 4.2% 0.3% 9.1% 59 121 93 273 26.5% 2.2% 7.8% 0 0 0 0 0 0 0 0 0
Large 0 3 6 9 0 1 0 1
SME 186 459 378 1023 5 23 14 42 59 121 93 273 26.7% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 110 239 205 554 2 5 2 9 18 33 18 69 6.7% 0.6% 2.0% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 32 94 79 205 0 11 8 19 17 34 34 168 16.3% 1.3% 4.8% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 35 78 64 177 2 5 2 9 20 36 27 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 8 43 29 80 1 2 2 5 4 18 13 36 3.5% 0.3% 1.0% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 1 5 1 7 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
Region E 111 274 190 575 4.6% 4 9 7 20 3.5% 0.2% 4.2% 31 71 51 153 26.6% 1.2% 4.4% 0 0 0 0 0 0 0 0 0
Large 0 1 1 2 0 0 0 0
SME 111 273 189 573 4 9 7 20 31 71 51 153 26.7% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 64 150 93 307 1 4 0 5 13 8 11 32 5.6% 0.3% 0.9% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 24 50 43 117 3 2 6 11 13 29 16 101 17.6% 0.8% 2.9% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 16 51 34 101 0 2 0 2 5 26 12 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 6 21 18 45 0 1 1 2 0 7 11 20 3.5% 0.2% 0.6% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 1 1 1 3 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
Region F 160 463 266 889 7.1% 8 23 16 47 5.3% 0.4% 9.9% 55 114 76 245 27.6% 2.0% 7.0% 0 0 0 0 0 0 0 0 0
Large 5 1 0 6 0 0 0 0
SME 155 462 266 883 8 23 16 47 55 114 76 245 27.7% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 82 214 110 406 1 3 1 5 24 16 10 50 5.6% 0.4% 1.4% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 42 96 71 209 4 11 7 22 16 37 23 162 18.2% 1.3% 4.7% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 21 106 59 186 2 7 6 15 9 46 31 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 5 45 26 76 1 2 2 5 2 15 12 33 3.7% 0.3% 0.9% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 5 1 0 6 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
Region G 2111 5928 4506 12545 100% 68 204 201 473 3.8% 3.8% 100.0% 623 1562 1298 3483 27.8% 100.0% 0 0 0 0 0 0 0 0 0
Large 25 16 26 67 1 4 1 6
SME 2086 5912 4480 12483 67 200 200 467 623 1562 1298 3483 27.8% 100.0% 0 0 0 0 0 0 0 0 0
Phases of Growth
Start-up Businesses (0 to 5 years) 1263 3207 2045 6515 12 42 18 72 237 352 219 808 6.4% 23.2% 0 0 0 0 0 0 0 0 0
High Growth Potential Entrepreneurial Enterprises (5 to 10 years) 382 1251 1147 2780 31 96 117 244 183 573 457 2285 18.2% 65.6% 0 0 0 0 0 0 0 0 0
Scaleup Potential Businesses (10 to 20 years) 279 1028 967 2274 15 53 53 121 140 462 470 0 0 0 0 0 0 0 0 0
Scaleup Businesses (20 to 50 years) 138 396 294 828 9 9 12 30 47 171 138 390 3.1% 11.2% 0 0 0 0 0 0 0 0 0
Repeat Scaleup Businesses (50 years plus) 29 30 27 86 0 0 0 0 16 4 14 0 0 0 0 0 0 0 0 0

You can store the Excel file in OpenAI’s file storage or AWS S3 and build an analytics module using Pandas. This can then be integrated with function calling using the OpenAI Assistant API to support retrieval-augmented generation (RAG). I have developed a module to dynamically manage and load plugins for function calling with the OpenAI Assistant API. This module is built on top of the OpenAI Assistant API and the AWS Serverless framework, utilizing AWS API Gateway, AWS Lambda, and AWS DynamoDB. It supports HTTPS interactions and streaming through WebSocket connections.

Currently, I am updating the README to document the recent enhancements to the module. If you’re interested in learning more about its implementation, I’d be happy to share additional details once the README update is complete. You can find the repository here: openai_assistant_engine.