First time poster here. I’m struggling a bit with my prompting approach, and would appreciate any help from more experienced folks here.
My use case is summarization of a JSON object into “natural language” following a user-specified template. If helpful, I’ve included examples of the prompts and JSON objects I’m using, at the end.
[Use Case Context] Specifically, what happens is this:
- In a web-UI, a user triggers the summarization of the current “page” in the software. The software represents the page content as a JSON. The overall structure of the “page JSON” is always the same, though the exact data it contains is not known ahead of time
- The user provides a “template” for how to summarize the JSON into natural language. For the non-technical user, the template is often a “page summary” example from an earlier project, when all of this was done manually. In other words, this is not a few-shot example, since the user is only providing a “page summary” and not a JSON to go with it
- The software passes the current “page JSON” and the “template” to the OpenAI API and asks GPT 3.5 to summarize the JSON object following the given template
[Results and Problems] This “somewhat works,” but has a few issues:
- Often, content from the user-specified “template” bleeds into the generated content
- Occasionally, the “template” isn’t respected at all
[Question and Additional Information] How can I improve on what I’m doing? A few notes:
- Templates are specific to projects and customers, so I can’t define them ahead of time
- I’ve thought of approaching template definition via few shot examples. To do that, though, I’d have to generate synthetic data for the page JSON and have the user manually write up a corresponding summary. That’s not trivial in this case, so I’m hoping to find something more straightforward
Example Prompts and JSON
System Prompt #1
This hard-coded prompt is simply an example JSON, to explain to the model how to parse the structure.
Here are instructions for how to read the JSON object below. The JSON describes an iteration that is made up of 3 steps (My First Step, My Second Step, My Third Step). Steps have a name (My First Step), a description (These are the instructions for the first step). Steps contain text (This is a comment to a step), datasets (DTS-1234) or models. This step is part of a phase (Modeling) that is the second phase in the project (PHA-2). The iteration ID (ITR-4) tells us that this is the 4th iteration of the phase. MDL-1 is a Model. MDV-4 Model Version. DTS-1234 is a Dataset. DTV-15 is a dataset Version. Widgets can only be created from objects with IDs that start with MDL or DTS. Widgets are never based on versions.
{"steps":[{"name":"My First Step","description":"These are the instructions for the first step.","artifacts":[{"text":"This is a comment to a step."}]},{"name":"My Second Step","description":"These are the instructions for the second step.","artifacts":[{"datasetVersion":{"dataSet":{"name":"This is the name of the dataset with ID=DTS-1234","description":"This is the description of the dataset with ID=DTS-1234","id":"DTS-1234"},"properties":[],"datasetSources":[{"files":[{"size":"21314242","rowsNumber":43276,"columnsNumber":152,"name":"dataset filename.csv","columns":[{"name":"My First Column","statistics":{"numerical":{"mean":1.234}},"dataType":"int64"},{"name":"My Second Column","statistics":{"numerical":{"mean":1.234}},"dataType":"int64"}]}],"dbs":[]},{"files":[{"size":"63935953","rowsNumber":129826,"columnsNumber":152,"name":"dataset filename.csv","columns":[{"name":"Column (0)","statistics":{"numerical":{"mean":5.678}},"dataType":"int64"}]}],"dbs":[]}],"id":"DTV-15"}},{"text":"Kept back 38% for testing (~ 200,000 rows). Seed is 5678."}]},{"name":"My Third Step","description":"These are the instructions for the third step","artifacts":[{"modelVersion":{"model":{"name":"Unit Sales Predictor","id":"MDL-1234"},"metrics":[{"key":"MAE","value":"0.81818181"},{"key":"RMSE","value":"0.41414141"}],"properties":[],"id":"MDV-4"}},{"text":"The model generated the following metrics: \nRMSE = 0.41414141 and MAE = 0.81818181"}]}],"phase":{"name":"Modeling","id":"PHA-2"},"owner":{"name":"Eazy-E Breezy"},"id":"ITR-4"}
System Prompt #2
This is the “template” that I expect the user to provide. The intent is for the user to provide some concrete example for the model to imitate, by replacing the information in the template with the content of the JSON passed via user prompt, below.
You are an assistant that describes JSON objects in English. Everytime you receive json input, I want you to describe it in the format below. This is THE TEMPLATE:
Recall that our modeling methodology involves the N following steps:
* First, we select a modeling technique
* Second, we generate a test design
* Third, we build the model
* Finally, we assess the performance of the model
This documentation refers explicitly to the iteration {Iteration ID} developed by {author name}.
We are working with a dataset (below) of store sales that includes location, time factors, marketing efforts, inventory, economic indicators, customer characteristics, online presence, and event influence.
{NOTE: Insert Dataset Widget DTS-7}
For this iteration, we selected a linear regression model to get a base model. We then split the dataset into training, testing and validation datasets. Here, N% of the dataset was set aside for testing (N rows). For dataset replication purposes, use a seed value of N.
The model's summary statistics for this iteration are MAE (0.XXX) and RMSE (0.XXX).
{NOTE: Insert Model Widget MDL-1}
As expected, the model performs better however this is not good enough and we should try a different method. We recommend doing a Random Forest as a new iteration to get a base model.
---
You must follow these instructions as closely as possible:
1/ Never justify your answers
2/ Never change THE TEMPLATE
3/ Never give information not mentioned in THE TEMPLATE
4/ Always express numbers to 3 decimal places max
5/ Always use the past tense
6/ Always use the "we" pronoun
7/ Never use curly braces
8/ Always start widgets with MDL or DTS
User Prompt
This is the actual JSON passed by the software. The general structure is the same as explained to the model in System Prompt #1 but the content is not known ahead of time. The idea is for the model to take this JSON and the template given in System Prompt #2, and write something similar given the information in the JSON below.
{"steps":[{"name":"Select Modeling Techniques","description":"","artifacts":[{"text":"For this first iteration we are going to use a Linear Regression model to get a base model."}]},{"name":"Generate Test Design","description":"","artifacts":[{"datasetVersion":{"dataSet":{"name":"my modeling dataset","description":"Store sales dataset includes location, time factors, marketing efforts, inventory, economic indicators, customer characteristics, online presence, and event influence.","id":"DTS-7"},"properties":[],"datasetSources":[{"files":[{"size":"21314242","rowsNumber":43276,"columnsNumber":152,"name":"testdataset.csv","columns":[{"name":"description_Puente Dia de Difuntos","statistics":{"numerical":{"mean":0.012778445327664294,"stdDeviation":0.11231851216113628,"quantiles":{"qMin":0,"q25":0,"q50":0,"q75":0,"qMax":1},"missing":0}},"dataType":"int64"}]}],"dbs":[]},{"files":[{"size":"63935953","rowsNumber":129826,"columnsNumber":152,"name":"traindataset.csv","columns":[{"name":"Column (0)","statistics":{"numerical":{"mean":173294.9089858734,"stdDeviation":99900.41755914238,"quantiles":{"qMin":3,"q25":86754.25,"q50":173456.5,"q75":259967.75,"qMax":346203},"missing":0}},"dataType":"int64"}]}],"dbs":[]}],"id":"DTV-15"}},{"text":"We split the dataset in a training, testing and validation datasets. 25.0% of the data is set aside for testing.\n - Training dataset size: 129826\n - Testing dataset size: 43276\n Our seed to generate repeatable datasets is 42"}]},{"name":"Build Model","description":"","artifacts":[{"modelVersion":{"model":{"name":"Unit Sales Predictor","id":"MDL-1"},"metrics":[{"key":"MAE","value":"0.17145044133864917"},{"key":"RMSE","value":"0.5022413478376052"}],"properties":[],"id":"MDV-4"}},{"text":"The model generated the following metrics: \nRMSE = 0.5022413478376052 and MAE = 0.17145044133864917"}]},{"name":"Assess model","description":"","artifacts":[{"text":"As expected the model performs better however this is not good enough and we should try a different method. We recommend doing a Random Forest as a new iteration to get a base model."}]}],"phase":{"name":"Modeling","id":"PHA-2"},"owner":{"name":"Eric Barre"},"id":"ITR-4"}