Inconsistencies in API response to same prompt and similar content

Hi all,

I’m using GPT-3.5-turbo API for my use where. Here’s what I do.

  1. I have a NodeJS javascript commandline that reads a Markdown file…
  2. Breaks the file to multiple chapters (there’s a specific format)
  3. Sends each to the server through the completion API
  4. Gets the response and stitches it back.

In each request, the prompt is exactly the same, i.e… “do X with the text, ignore any line starting with #, preserve Markdown styling…” and so on.

However, I see that the responses often vary. In some chapters GPT “eats” multiple lines and doesn’t send it back, in other cases it removes the Markdown styling and so on.

Is this expected? Do I need to go to GPT-4 for better quality and consistency (it’s so much more expensive though)

Edit: I’d say that in 80% of the cases, the responses are as expected and GPT does what is being asked in the prompts.

You can try GPT-4, certainly “smarter” and better at more complex things. What temperature are you using? You can try lowering that as well, higher temperature leads to more “creative” responses. Trying variations of your prompt is also key at getting back results you want. Hard to provide more concrete feedback without seeing your full prompt though.

Working with LLM’s is all about experimenting with the controls/models and finding what works best. They are not deterministic, so its not expected that every query returns the exact same result.

1 Like

Hi there!

Thank you for sharing your experience with using the GPT-3.5-turbo API. It’s great to see the work you’re doing with NodeJS and Markdown files. I’d be happy to help address the issues you’re encountering.

Based on your description, it seems like you’re experiencing variations in the responses from GPT-3.5. In some cases, it removes lines or Markdown styling, which is not what you expected. It’s understandable that this can be frustrating. Let’s explore some possible solutions together.

To better understand the issue, it would be helpful if you could provide concrete examples, including your existing prompts, the input, the expected output, and the actual output. With this information, we can delve into the root cause and provide more accurate guidance.

It’s possible that the inconsistencies you’re experiencing can be addressed within your prompts. By examining successful and failed results, we can identify any commonalities or differences and make adjustments accordingly. It might also be helpful to consider adjusting the temperature or top_p parameters to constrain the model’s outputs and improve consistency.

Taking a combination approach is often the most effective way to ensure near 100% correctness. Here’s a suggested plan:

  1. If you notice any patterns or commonalities among the failed outputs, tailor your prompts to be explicit about the desired behavior in those specific cases.
  2. If it doesn’t exceed your context-token limit, include a one-shot example using a failed case input with the proper output. Additionally, adding more failed cases and their corresponding proper outputs can greatly enhance the model’s performance.
  3. If the failure isn’t purely random and consistently occurs with the same set of inputs, collect as many failed input/output pairs as possible. Then, experiment with different parameter settings, such as adjusting the temperature from 1.0 to 0.1, to determine which settings yield the highest success rate.

By following this iterative approach, you may be able to reduce the failure rate and identify specific inputs that cause issues more easily, even before considering a switch to GPT-4.

It’s important to note that GPT-4 could potentially offer better results, but it comes at a higher cost compared to GPT-3.5. Given that you’re already using GPT-3.5, I recommend putting effort into optimizing its performance before exploring GPT-4.

To make the debugging process smoother, I suggest taking detailed notes. Document your current prompt, which cases fail, and any ideas you have for improvement. Write down the changes you make to your prompts, along with your expectations and the actual results. This may seem tedious, but it will serve as a valuable learning tool and a handy resource when troubleshooting late at night.

I hope these suggestions prove helpful as you work towards resolving the issues you’re facing. If you need further assistance, please don’t hesitate to return with as much detailed information as possible. We’re here to support you.

Wishing you the best of luck and success in your endeavors!


Thank you both.

Temperature is 0 currently.

I will tune my prompts for greater precision and report back if I continue to see issues, and provide examples of the “problem children” at that time.

1 Like