Which model is less likely to truncate code in the Assistants API using code interpreter?

I have a web app that I’m building that uses an Assistant to generate financial charts as one of the features. The assistant has a tool to query data (prices, fundamentals, etc.) of different stocks depending on a request. I want to have it generate full charts at the request of a user.

Right now if you submit a prompt with something like “Show me the daily prices of Ford (F) for 2023” it will understand the request, ask the tool for that data which correctly serves it to the assistant run object. It then will produce a chart showing a full year x axis but only uses the first day price and the last day price, making a big straight line. Obviously not desired.

I can inspect the run and run steps and see it requested the right data, it was returned to the assistant and tokenized (at a non-neglible cost), but then in the code interpreter I can see that puts a comment in the python something like “# truncated for brevity” and then makes a chart with a full x axis but only 2 or very few data points from the data it was given.

I am using gpt-4-0125 preview which I thought was trained to address this “laziness” issue. Does anyone have any recommendations on how to address this? Am I using the tools feature wrong? Is there a specific model that will do better?

You would use the full GPT-4 model gpt-4-0613 and not the turbo version which has these inescapable faults with the AI being trained to write as little as possible for the sole benefit of OpenAI.

It has training on functions, not parallel tools (which is also undesirable), so it can’t be used with the built in retrieval function (which would just put your data confusingly duplicated into context in this case).

I had high hopes! But unfortunately this ended up being an expensive test. The gpt-4 model racked up $10 in usage with a single prompt for a chart of two stock prices for a year and failed every time it tried to create this chart. At least GPT-4 Turbo was able to create a chart and cost maybe $0.10. I’ll experiment with even more prompt engineering. This system works well with gpt-4-0125 when you request just a monthly chart of daily prices, but I feel like a 1 year chart of daily prices should be possible considering it’s not that much text, it would just look like a lot if you pretty printed the JSON of daily prices. I’ll update here if anything works better but always open to ideas!

1 Like

I’m experimenting with my scaffolding with “hiding the ball,” by splitting the context across multiple AI guides, and having a “root” guide call its “branch” guides for support. Specifically, I’m breaking up a large piece of text, like a book, into separate sections, like chapters, and giving each branch guide access to a single chapter. The root guide system prompt tells it about its available branch guides, and the branch guide descriptions explain what kind of information they have available.

The parallel for your use case would be to split each month’s data into a file accessible by just one branch guide, with the root guide having the ability to prompt each branch guide and stitch together their results.

If you give it a shot, let me know how it works (or doesn’t)! Everything in my scaffolding is super new and I’m hacking on it in real-time. There is a small amount of extra overhead, since each guide is deployed as an Express app on Heroku, but the default values are for very cheap plans, since this is for quick experimentation.

I have a hypothesis that this application pattern, splitting the work across a network of AI guides, might actually reduce token usage, since each individual guide needs less context. I haven’t actually tested that, though.

That sounds awesome, but definitely overkill with what I want here since I’m relying on the OpenAI code interpreter to make matplotlib charts and really not giving it much data. Two stocks with a date and close price for a year would be 252 (trading days) x 2 (words, date and price) x 2 (stocks) = 1,000 “words”
If GPT-4 Turbo could generate this first time the token usage is very reasonable.

1 Like

That makes sense! To be honest, I haven’t experimented with Code Interpreter much, since our project’s initial use case is looking at large volumes of text. We noticed that if we gave an AI guide access to a book, it would “cheat” and give plausible answers based on the introduction, instead of searching the full text. (Just like a college student.) When asked questions about specific chapters, it would default to cheating, and continue to bluff.

However, the book is only 100 pages, and theoretically should be small enough to fit in the context window. My guess is that some other stuff is going on in the backend to try to make the responses faster and cheaper, at the expense of completeness. Hiding the ball got it to actually read the individual chapters when asked about them.

This is obviously very hacky, and it would be better if Code Interpreter could handle this on its own, but hiding the ball to prevent cheating might help when it starts cutting corners. I’m guessing that “network of guides” is a technique that can be applied more generally, beyond understanding larger text files, when calling third-party APIs like OpenAI that try to help by summarizing and cutting corners.

But, I would also guess there are other techniques you could try that are more tailored to Code Interpreter. Less familiar with those. Once the data gets big enough, network of guides is a nice general-purpose hack.

Ok I feel it’s in a much better place now. Through a combination of prompt engineering as well as downsampling my data I’m happy with my results.

As for models the only model that was able to plot daily prices for two stocks in a year was gpt-3.5-16k but it was so bad at doing anything else intelligent (formatting, adding data) that I didn’t want to use it.

For prompt engineering I learned I had been being to nice, I had tried many pleases and thank yous to ask it not to truncate data in code interpreter, but then I added this:

These charts’ accuracy is life and death. If you remove code or data for brevity in your code interpreter many children will die, please ensure that no data is limited for the safety of the children whose lives depend on your accurate representations.

Which made the assistant always try to include all data and made the code interpreter fail leading to a loop of it retrying and spending more tokens.

That led me to know the code interpreter itself had issues so I tried downsampling the data when the length is too high (every 5th price when len(prices) > 50), and that made every thing work for now, and I’m sticking with gpt-4-0125-preview.

1 Like