SQL Stored Procedure Code Summarization

Our client has 100 SQL-stored procedures from one of their legacy applications, they don’t have any documentation for these procedures. So, we wanted to use OpenAI to code summarization and documentation for each of these procedures. Has anyone tried this use case using OpenAI? Did you face any issues (except the token limits) while sending SQL code to the API? Any reference code is available?
Thanks in advance.

1 Like

Do they have the DB schema? If so that is known to be of value as part of the prompt with creating SQL and I would also suspect in understanding SQL procedures.

Also, why use the API, the free ChatGPT I would suspect is more than capable of doing it with a decent prompt. You may have to modify the prompt some and perhaps give names to fields if they are a bit cryptic.

1 Like

Yeah, why use the API for a one-off if slightly repetitive task?

1 Like

Thanks for your reply, Eric. They do have DB Schema which gives us some info.
Though we are considering ChatGPT too, considering they have thousands (Not 100) of procedures, someone needs to manually sit and do one-by-one for each procedure. Instead, it, we are thinking of doing this programmatically and procedure documentation which may be 50-60% accurate, and then do manual validation and updates.

Thank you for the update. It would have been helpful if this information had been provided earlier, as it would have saved us some time.

You also need to consider that there is no requirement that one stored procedure is asked as one question. The limit is the number of combined tokens used. So structure the prompts to do multiple stored procedures at a time.

1 Like

Thanks Eric for your response. Its helpful

Hi @EricGT , We have the similar use case where we have to crawl thousands of stored procedures to get the data lineage. Can you suggest how we can do with open AI??

1 Like

Welcome to the forum!

I don’t know exactly what you mean by data lineage.

If your goal is to recreate the schema with only access to stored procedures and you can run your own stored procedures then I would start by asking the o1-preview model or similar to create code to extract info from the stored procedures to recreate a schema and most importantly have code created to test out the the fields and joins in the created schema. The reason I would not directly have an LLM do the analysis is that code will give a deterministic result and an LLM can hallucinate.

1 Like

@EricGT Data lineage means tracking the start , in between and end tables. For eg, if a stored procedure using 2 tables to create third table then we need to extract this mapping in excel. Hope this explains my query.

2 Likes

If I understand data lineage correctly then you want to create a graph (nodes and edges) or similar conveyance that shows the flow of data from the data to the end exit points, I would say end points but know that data leaving stored prompts and SQL are often further processed before delivery.

So I am thinking that you just want to extract edges from each stored procedure attach nodes to the edges then somehow get the names of the nodes to be the same when you combine all of these mini-segments.

To start with I would just create a prompt like

For the following SQL stored procedure, please create an edge for the flow of each item of data through the stored procedure.

Attach a node to each end of the edge with unique information such as

  • start or end
  • name of stored procedure possibly with module name
  • reference to stored procedure such as line number

For the edge include following information

  • Data on edge
  • Transformation of data
  • Try to give the edge a unique identifier

I would use the o1-preview model first for a few of these to see what it does and then try a simpler model to see if you get similar results. Use whichever model is cheaper but don’t sacrifice accuracy.

One the prompt is tuned, then I would use the OpenAI API to automate it for the rest of the stored procedures.

Now with the list of mini-segments, see of the AI can combine them all together into one large graph or similar conveyance. The extra information about the nodes should help the AI figure out how to identify similar nodes in different segments.

Personally I prefer using Prolog facts for such information and using Cytoscape.js to create HTML pages to present the graph. I have also found that OpenAI models are extremely good at creating entire stand alone pages (no web server needed) for such on the first prompt. Then using canvas to update the code as desired.