Fine tuning GPT 4o or 4o-mini on our codebase

Objective
Fine tune GPT 4o or 4o-mini on my project’s codebase, such that I can ask it to write new code and it will be written within the style and re-use existing functions/components that already exist in my code-base when possible.

Fine tuning privacy policy?
Q1. OpenAI says they will not use data submitted via API to train new models. Does that include fine-tuning data?

Fine tuning token upload cost
The docs say that I can upload x million tokens per day for fine tuning for free.

(When you upload training data via API)
Q2. Is it a separate process to upload training data VS create fine-tuning model?
Q3. Can I can upload 1 million tokens for free every day for 7 days,
Then create a fine-tuned model with those 7 million tokens of training data, for free?

Q4. Is it free to host my model for my own use? So after I’ve fine tuned, I don’t pay any ongoing storage cost for my model to be hosted by OpenAI to be available for me to use (I only pay for Q&A tokens when I query it) is that correct?

How to fine-tune on codebase
I had a look at the docs and it seems they want me to submit all data in the form of questions and answers.

Q5. Please let me know if I’m on the right track.

{"messages": [{"role": "system", "content": "Bob is a JavaScript developer that blah blah blah"}, {"role": "user", "content": "Create foo.mjs that does xyz"}, {"role": "assistant", "content": "```
import abc from '/path/to/abc';
export const greet = () => console.log('Hello world');
...
```"}]}

If I’m on the wrong track please tell me specifically what I need to do?

Thanks!

No, no, uh-huh. Just no. What you want to do is instead of fighting tooting it on a code base unless it’s a very large code base that several hundreds of 1000 lines long which is probably not I imagine it’s something like a small react out of 80 to 125 on average for what most people are coding now in days you just want to use rag, get yourself the IDE cursor it’s $20 a month and it includes the feature which will automatically present portions of your code that you selected at the time of the generation of the code that you are having generated for your code base so that it has the context it needs at the times that it needs it even includes embedding automatically and websites citations and documentation ingestion 20 bucks a month it’s unlimited get it love it. Fine tune models cost more to run cost money to train and for one code use of one code base unless it’s a significant, Enterprise large code base where you need to have several different developers be able to work with AI on your code basis simultaneously without having to use rag inside of prompting as described it might not be worth it but more use case details would be needed.

1 Like

To provide a bit more context. The project has been under development for 4+ years by a small team of developers and is over 260,000 lines of code, for the main project.

Can you please elaborate on “IDE cursor” and rag?

1 Like

Cursor Ide was invested in with 8 million dollars by open ai. Cursor.Sh is the domain and it’s the best AI coding platform by far. I use it daily. For a code base of that size embedding and fine tuning would be in order. I would even recommend using a context message delivery assistant system that uses haskel and await to deliver key context at test time.

2 Likes

@Foxalabs this is the thread :slight_smile:

@cobusgreyling also :slight_smile:

1 Like