Strategy Recommendation for "Custom Code Generation GPT" through API

Our use case is pretty specific. We need code generation in python using internal libraries / modules. The main purpose of the scripts is parsing of json output from various REST API endpoints.

We have 1000’s of examples of these scripts in the style we want already written. We also have cookbook style documntation in markdown we can provide the model. Including example input json, api endpoints, example output, etc.

We’ve PoC’d out examples of this using the Assistant API and the Chat Completions with GPT-4 and 3.5-turbo models. GPT-4 provides exceptionally better and more consistent code generation, so seems the right model to target.

It doesn’t look like GPT-4 has fine tuning available. Which raises the question of how best to inject all this documentation and examples we have to improve gpt’s completion.

From reading the API docs, it doesn’t seem retrieval is suited for this purpose. Injecting examples and docs into the prompt seems to increase evaluation time considerably.

Have others tackled similar tasks? What’s the best approach and which tools to use to get there?

Just curious, why would you say that? :thinking:

Can you give an example of what the examples look like (is there a lot of overlap?), and what you expect the user input to look like?

Stuff people seem to play around with most is HyDE (hypothetical document embedding) and reranking, and a backwards hyde (generating hypothetical queries) along with rejecting similar results might work for your use-case :thinking:

Just curious, why would you say that?

The documentation implied Retrieval would be for FAQ checking. My assumption (from your question, perhaps incorrect) is it’s designed for interpreting a question and retrieving a specific, minimally modified “answer”.

Can you give an example of what the examples look like (is there a lot of overlap?), and what you expect the user input to look like?

Sure. The purpose of the internal libraries is to minimize boilerplate. So the scripts are incredibly repetitive. User input is a json object that has two keys: the endpoint to parse, and a partial example of the endpoint get response output (not necessarily valid json). So the thread looks like something like:

User → {“endpoint”: “”, “example”: ‘[{“name”: “a name”, “address”: “an address”, “other keys”: “values”}, {…}…]’}

System → {“script”: “import module\nimport module2\n\ndef extract_entity(…”}

running the script retrieves a get response from endpoint, then extracts the keys into an object for canonicalization. Then writes to file.

from acmerequests import AcmeRequests
from acme.record import AcmeRecord
from acme.writer import AcmeWriter
from acme.record_deduper import AcmeRecordDeduper
from acme.record_id import RecommendedRecordIds

def fetch_data():
    api = ""
    params = {
        "page[limit]": "500",
        "filter[radius]": "30000",
        "filter[lat]": "55.755826",
        "filter[lng]": "37.6173",

    r = session.get(api, headers=headers, params=params)
    js = r.json()["data"]

    for j in js:
        # entity extraction code, specific to the input record format
        row = AcmeRecord(


if __name__ == "__main__":
    page_url = f""
    log = AcmeLogSetup().get_logger(logger_name="sourcename")

    headers = {
            # required headers for requests

    with AcmeRequests() as session:
        with AcmeWriter(AcmeRecordDeduper(RecommendedRecordIds.Id)) as sgw:

The scripts are generally relatively short. Maybe a few hundred lines.

Ah, maybe we should disambiguate a little.

If you mean Assistants API retrieval - who knows :grimacing: (I don’t). However, it might still be hackable into a prototype, but I’m not sure how reliable you can get it.

In general, when we talk about retrieval in the LLM context, we mean embedding vector search, but it can be anything. FAQ is one of the trivial usecases for vector search, but it can be used for so much more. It’s essentially meaning encoded as a coordinate, a coordinate you can compute a distance with.

My guess is that you might want your tool to figure out what the user means, find relevant information, inject it into the prompt, and then allow the LLM to generate a response. Retrieval Augmented Generation.

What I meant by “user input” is the query that GPT-4 will have to answer. I doubt you’d want users to manually enter JSON code (unless it’s a copy-paste situation, or if your user is actually a machine :robot:).

But if indeed your input is a json object - what would be the retrieval criteria? a matching function signature? :thinking:

I’m wondering if you may not even need a language model at all, but I might be misunderstanding - you suggested that the code might need to be rewritten…

Sorry if I’m slow to understand your problem or if I’m explaining stuff you already know :sweat_smile:

Yeah, I wasn’t referring to the concept of retrieval…but the concrete “Knoweldge Retrieval” tool available to the Assistant API.

I think the generic question I’m asking is:

We have a lot of documentation not available publicly needed to most accurately generate the requested scripts. Since it’s much greater than the 128k limit for a gpt-4 prompt/user message how can we inject this context into the model?

Well yeah, if assistants don’t work for you, I’d still go with context dependent context augmentation (RAG)

That said, it’s still possible that assistants can work for you - if you can fit enough relevant information into a chunk.

I actually set up the assistant today. I was able to upload our documentation and about 100 example inputs and script files.

The Assistant is definitely using them to generate. But I still have the problems that led me to use the completions api before.

Namely, the Assistant models seem much harder to instruct. They often ignore prompting to generate a runnable script and insert place holders or hand wavy comments e.g. “# this is where I’d perform requirement A” or “# this should be updated for your use case”. They also ignore function calling instructions or output formatting requirements.

With a bit of prompt massaging, I can get the chat api models to return a generated script, no other formatting, etc. It’s really just a matter of fine tuning.