Fine Tuning on Good Python Code

Hello, I am interested in fine tuning on good Python code.

I do know that embedding and adding a vector DB (on the codebase and documentation) will help when it comes to new features and making specific code available.

I am also interested in fine tuning on a corpus of well written code, to encourage GPT to output code in this style. It’s not too hard to think of separate fine tuned models that are nicely tuned to output good functional code, or OOP code, or whatever style we like.

Has anyone managed to come up with good prompt;response pairs for fine tuning code?

One thought I have is to make sure that every small script (or function of a large script) has a docstring. The tuning would then be prompt = docstring, response = code. This is a pain if the existing code base is not well commented. I did wonder about asking GPT to summarise code chunks (small scripts, functions in large scripts) and then feeding these as prompts into fine tuning.

Has anyone found different or better ways to fine tune on code?

One of the issues is GPT3.5 Turbo and GPT-4 are not currently tuneable, so you would be reliant upon GPT-3 for the code creation, I think you will get better results from “example” prompting and also making use of known python coding standards and telling the model to use those standards.

1 Like

Don’t fine tune. Fine tuning is currently expensive and the output is not that great. You can give it samples of good code.

What works for me is giving it a sample of good code. Just select relevant code parts that you’re trying to emulate. Give it context, a sample input and use the code as sample output. Then give another input that is the real thing.

The structure would be something like this, but replace the Q&A with the kind of code you want to generate: OpenAI Platform

1 Like

Correct. Answer these questions:

  • Can you fine tune rudderless davinci (GPT-3) to beat well-trained GPT-4?
  • Do you want to pay twice as much per token for a model with half the context size?
  • You mind when they turn off your model in five months? And may or may not give you credits for re-training on what comes next?
1 Like

Thanks for the helpful replies.

@Foxalabs I do accept your point. GPT-4 is much better at coding than GPT-3 so a well tuned GPT-3 might not match GPT-4. On the other hand, if tuning on GPT-4 arrives long before GPT-5 or whatever, then it could be of interest.

@_j thanks for the questions:

  • Can you fine tune rudderless davinci (GPT-3) to beat well-trained GPT-4?
    • Quite possibly not, but it could help get me ready for when GPT-4 tuning is available (assuming that GPT-5 or whatever is not coming soon).
  • Do you want to pay twice as much per token for a model with half the context size?
    • Not a problem. The human cost of trying to curate each prompt is on a different scale.
  • You mind when they turn off your model in five months? And may or may not give you credits for re-training on what comes next?
    • If re-tuning on the next model is available, that’s OK. The tuning set would be quite small.

@smuzani Fine tuning pricing looks extremely cheap, compared to the human time it’s taking to curate code examples and fit them into the prompt. But thanks, I am now thinking of a way to accomplish my task based on the example you gave.

The “tuning set” will be too large for user, assistant style “in-prompt training USER,ASSISTANT pairs” like in the example (i don’t know the correct term for this). Handpicking the functions would be costly in human time (this is part of my current problem).

So I have a plan for “live tuning”:

  1. GPT could create the docstrings from good example code, if they don’t exist yet.
  2. get embeddings for the docstrings and code.
  3. Create a vectorDB, to search for docstrings related to the final, USER part of the prompt and return docstring-code pairs.
  4. Parse and feed into the prompt as USER,ASSISTANT pairs. Like live-tuning the prompt based on relevant examples.
  5. Add the final, USER part of the prompt which is in the style of a docstring.

I was already keen to attach a vector DB to make new features and specific code available. By using the USER,ASSISTANT pairs in the prompt I think I can also use it to “tune” the prompt to reflect best practice coding style.


You’d want to have a diverse set of instructions, of course - and also a diverse set of system prompts unless you’d always give it “An AI-powered programming assistant answers this:” in implementation. Prompting in a form that inspires completion will always help.

Multi-turn conversations demonstrating how to handle context.

Proper balancing of weights and sizes of various types.

The type of instruction-following it might do, debugging, analyzing, extending, rewriting, they are almost unimaginable in scale. OpenAI has some smart cookies.

  • “print a tree that shows the nesting of my UI objects”
  • “does the module have a method that supports changing the width of the drag handle?”
  • “Above is 80 lines of code written in xxx. I want to modify this class’ def handle_resize where it calls max_truncate to find the length of text that will render in a particular width of pixels. What are some optimized search methods where I actually render and test the text to see if it fits upon resize in the ui object cue_message after having added new knowledge of the last size by adding prior_width_px to the function call? Give optimized code that considers adding an elide for truncation and re-searches.” (an ambiguous version of something I had GPT-4 work on)

You can look at some of the synthetic prompts of WizardCoder-15b for inspiration. There are ridiculously impenetrable examples. And it still is very rigid in what it can do.

1 Like

The use-case is for writing code to be used in an existing codebase. The instructions probably won’t be very diverse, because the idea is to find code that’s similar to the required new code.

For system prompt I might think of something like “You will be provided with a brief description (docstring or pseudocode) for a script or function. Please provide working Python code to achieve the functionality required”.

Then it will see a few USER;ASSISTANT pairs from similar code written by good programmers. Hopefully, following that lead and producing good code for the final USER part of the prompt.

I guess there’s some compromise regarding how close the output code should be to the provided examples, vs the intent of the final user docstring.

Thanks, I will look into the WizardCoder-15b prompts.

The “you will be” in a prompt plays on the fine-tuning of a pretuned chat model with an identity.

A base completion model that can be used for fine-tuning doesn’t know it is a “you”. text-davinci-003 does because it was further tuned to be early ChatGPT, with a Human: AI: type prompting scheme.

I offer completion examples, increasing from creative writing completions to elucidating answers, to get you thinking about how to interact with the models without putting another layer of difficulty in the task of fine-tuning:

Apples come in several colors, like


I think too many people are moving to


Newspaper Headline: Biden declares war on poverty
Abstract: The president has announced new funding for states

or multi-shot to just use the untrained model to chat (poorly)

Human: What is the capitol of Paris?
AI: I think you’ve asked backwards - Paris is the capitol of France.
Human: What is the acceptable name now for Eskimo
AI: Inuit is the preferred culturally-sensitive name.
Human: Can you simulate a game of pong?
AI: That’s beyond my capabilities - I can only write helpful text.
Human: {input}

or finally, the type of prompt that gives answering:

A helpful and knowledgeable GPT-3 chatbot had a conversation with a user of the OpenAI service. Here’s that chat, and we get to see this interaction that shows the AI skill in answering programming questions, and more, like a real person.

user: {input}

It’s good to try out the base model to see how far you have to go to reach your destination output.

1 Like

I didn’t understand the aspect of identity based conversation vs simple completion. Until now, I’ve only worked with chat-tuned models.

So to move from “chat” to “completion” style, we might have something like this:

system: Complete the user input by adding valid Python code.
user: {pseudocode)
assistant: {code)
user: {pseudocode)
assistant: {code)
user: {input)

The original model knows how to output valid Python code without the user;assistant examples. But with them, it’s more likely to give the kind of code that it sees in the examples (good example code from the codebase that we’re working with)?

The example pseudocode can come from GPT and then be embedded into the vectorDB along with the code. The vectorDB gives us a few pseudocode;code outputs relating to the input.

If this works, then it would give the end result that I was originally looking to get from fine tuning. Except that the input and output need to be small, as the “in prompt live tuning” will use tokens. If fine tuning will be released for GPT-4 32k model, this token cost could be fine.

That can be the type of technique that you’d use with text-davinci-003. It is very fast to pick up on expected outputs for expected inputs by example.

However, to me, the form of prompting is tweakable or even redundant there. The system message would seem to inspire what comes after your pseudocode in python code form, rather than rewriting as python. davinci-003 follows it reasonably well, though - up until that alternate token gets picked as the first.

Then three quality embeddings retrieved could be good, but all being python, you might have little differentiation between the top values. The form then becomes more important, but only for showing instruction following, I wouldn’t expect your masterful python revelation, but good following if you’d also provide the same-style input for the same expectation.

actual pseudocode->code doesn’t give the AI much inspiration for creativity, but simply replace that with our understanding {user_style_instruction}:{explanation_and_solution}

You might instead up the quality of {pseudocode} inputs by having more examples of:

  • asking for corrections in code that produce the “repaired output”,
  • natural language with asking both ambiguously and precisely that result in the generation,
  • generate frameworks that needed the requested section added,
  • transformations to different purpose,
  • or the code being an improvement requested on a prior version.

All in how a user would ask. Another prompt and imagination challenge in working backwards.

(BTW, giving davinci-003 no-clue-what-I-wanted metacode - we also get more completion - of a quite unentertaining game):


User: write python: banana_farm function is defined. A message is printed with the purpose of the banana farm game. Then we ask how many players. Then we introduce the rules. Then initialize the variables needed to maintain the game state.


def banana_farm():
    print("Welcome to the Banana Farm game! How many players are playing?")
    num_players = int(input())
    print("The rules of the game are simple: each player takes turns rolling a die. The player with the highest roll wins the round and gets a banana. The first player to collect three bananas wins the game!")
    # Initialize variables
    bananas = [0] * num_players
    round_winner = 0
    game_over = False
    # Main game loop
    while not game_over:
        # Roll the die
        roll = random.randint(1, 6)
        # Determine the round winner
        round_winner = (round_winner + 1) % num_players
        # Give the round winner a banana
        bananas[round_winner] += 1
        # Check if the game is over
        if bananas[round_winner] == 3:
            game_over = True
            print("Player {} has won the game!".format(round_winner + 1))
1 Like

I agree that a lot more variety would be needed, to cover use by the general public. That seems hard to cater for and I can’t help but but think that points back to fine tuning on a large corpus of code in the preferred style.

For an “in house” use case, I agree that it would be very important for the input pseudocode (or docstring) to match the style of the example pseudocode (or docstring), which could work for repetitive tasks (e.g. adding lots of new API connectors to an existing codebase).

Ideally I’d like a chain, or even better a hierarchical pyramid of AIs, collaborating on a project. But that’s far out of scope given my current understanding.

I am impressed by the banana_farm game. The user input doesn’t seem to provide any hint as to the purpose of rules, yet the AI comes up with something very credible.