Fine tuning for new programming language

I’m trying to teach gpt-40-mini a new programming language. I’ve got about 1,000 training snippets and 157 validation snippets. The language is fairly simple in syntax, homoiconic and purely functional in nature, with no complex syntax.

How should I proceed?

  • Do I have enough training data?
  • How should I configure my hyper parameters during fine tuning?

When I fine tune OpenAI’s platform dashboard fills in defaults as follows;

  • 3 epochs
  • 2 batch size
  • 1.8 LR multiplier

When I ask ChatGPT about what numbers to use, it tell me more batches and much less LR, but this results in a model that’s wildly hallucinating - And the default settings creates a much “better” model, with less hallucinations. I’ve therefore decided to completely ignore ChatGPT’s (gpt-4.5) instructions here.

Suggestions …?

You can check out my training data here. I remove the file comment and use it as user’s prompts, and I add “You are a Hyperlambda assistant, respond with Hyperlambda” as system instructions to my JSONL files.

Please advice …

1 Like

The learning rate is a multiplier, scaling OpenAI’s own settings which vary based on the size of training file. A setting of 1.0 is not passing the parameter.

The default learning rate of fine-tuning at any particular set size is not subtle; it can make 10 examples become the monotone default, countering millions of RLHF that otherwise “chat”. I would go lower than default, and leave the batch size alone because of zero documentation.

Validation should not necessarily be a distinct set, unless it is a test case where, for example, you can see the inverse decline in useful domains such as alternate programming languages being trained out.

A common mistake is to think that instructions given in fine-tuning make the AI follow instructions. Instead, it is examples and patterns of expected user input generating the desired output of the task to full quality that you must show—and then hope a quality of inference arises between those examples without overfitting.

1 Like

Thank you, I know how fine tuning works, and the basics of back propagation. Where can I find extensive documentation for hyper params and fine tuning? Got some great resources you think can help me solve my problem?

Maybe examples of great training material, with information about how much data I’ll need, etc?

Did you look at my training data? Does it look good?

The main thing is that for a period of time, OpenAI had taken away all hyperparameters for fine-tunes except for epochs (epochs being simply passes through the file, 3 the same as if you pasted file contents two more times at epochs=1).

In the original endpoint, where you’d have to conjure up a number like 0.2 based on your training expectations, it had the benefit that you could send 100 or then 1000 more at a model continuation, and not have variety in the training depth out of your control. Other than that, it was just a number also without any documentation of good ranges on OpenAI’s models except that which was obtained experimentally and anecdotally.

That leaves you lots of opportunity to come across bad “learning rate” values online.

The only OpenAI confirmation I recall is that multiplier is a multiplier on top of input-size-adaptive, but the forum is polluted with my own re-telling.

I’d say run at auto, receive the hyperparameters and plot the training and validation loss, along with performance. Then prod with refinements. You’ll probably receive more n_epochs billed than necessary, but you are also now given mid-training “steps” that can be your permanent use if it ends up overfitted.

Otherwise, in documentation you get 3 examples of Marv the chatbot that inspires “I gave the required 10 examples, why doesn’t it answer about my files?” forum posts, and not much else for guidance.

Some GPT-4.5 guesses at batch parameter

1. Batch Size (batch_size)

Definition:
The batch size refers to the number of training examples (prompt-completion pairs) processed simultaneously in one forward and backward pass through the neural network. Each batch produces a single gradient update step.

Underlying Effect on Training:

  • Gradient Variance:

    • Smaller batch sizes (e.g., 8, 16, 32) result in more frequent parameter updates per epoch, but each update is based on fewer examples, leading to higher variance in gradient estimates. This can help the model escape local minima and potentially generalize better, but may also cause instability or noisy convergence.
    • Larger batch sizes (e.g., 128, 256, 512) produce more stable gradient estimates, reducing variance and noise in parameter updates. However, overly large batches can lead to poorer generalization due to fewer updates per epoch and potentially getting stuck in sharp minima.
  • Memory and Computational Efficiency:

    • Larger batches require more GPU memory but can leverage parallel computation more effectively, potentially speeding up training if hardware resources are sufficient.
    • Smaller batches require less memory but may slow down training due to less efficient parallelization.
  • Interaction with Learning Rate:

    • Batch size and learning rate are closely coupled. Typically, larger batch sizes require proportionally larger learning rates or adjusted learning rate schedules to maintain effective convergence.

“Auto” Setting (Inferred Behavior):
When set to "auto", OpenAI likely employs a heuristic or adaptive formula based on dataset size, model size, and available hardware resources. A common heuristic is to select the largest batch size that fits comfortably into GPU memory, balancing computational efficiency and gradient variance. The exact formula is undisclosed, but it likely scales batch size proportionally to dataset size and GPU memory constraints.

So, I should always use “auto” then? How about my training data size? I’ve managed to synthesise (and test) data by having an example snippet and generating multiple examples using ChatGPT.

Notice, I only use snippets that actually works.

This is a new programming language, although simple in syntax. How many snippets do you think I’ll need? I’ve got about 2,400 now, and using every 8th snippet as validation (randomly).

Psst, thank you :slight_smile:

1 Like

I don’t think there’s a case of “too much training data”, until your file is too big in megabytes or tokens to send. Only quality.

OpenAI can do RLHF and supervised/PPO/DPO on bajillions and deliver a model.


Examples? provide the production you expect out of the AI, like:

system:

“a zSharp programming expert”

user:

“In zSharp, write an async function to do a bubble sort on two columns of data”

assistant:

zSharp is a new logic programming language that doesn’t support asynchronous programming or the concept of functions in the same way as imperative languages like Python or JavaScript. However, you can implement a quicksort algorithm in zSharp. Here’s how you can do it:

% Partition predicate
partition([], _, [], []).
partition([(X1, X2)|Xs], Pivot1, [(X1, X2)|Ls], Bs) :-
    X1 @< Pivot1, partition(Xs, Pivot1, Ls, Bs).
partition([(X1, X2)|Xs], Pivot1, Ls, [(X1, X2)|Bs]) :-
    X1 @>= Pivot1, partition(Xs, Pivot1, Ls, Bs).

% Quicksort predicate
quicksort([], []).
quicksort([(X1, X2)|Xs], Sorted) :-
    partition(Xs, X1, Littles, Bigs),
    quicksort(Littles, SortedLittles),
    quicksort(Bigs, SortedBigs),
    append(SortedLittles, [(X1, X2)|SortedBigs], Sorted).

This code will sort a list of tuples based on the first element of each tuple. If you want to sort based on the second element, you can modify the partition predicate to compare X2 instead of X1.

Please note that zSharp is a declarative language and the concept of “async” doesn’t apply here. The execution order is determined by the logic of the program, not by the order in which statements are written.

(guess the actual language)

An assistant answering in snippets and no edge cases? You might get no conversational thinking, no resolving issues in specifications, nor the output fulfillment as you’d actually desire. The training style is going to be well-followed and constrained at the weight that could ever teach a language.

You are trying to simulate LLM pretraining on a huge language corpus that imparts programming skill, using a very small set comparatively with supervised training, so all bets are off. I’d start with a mini model, as they seem to take readily to programming tuning (by the big dogs) as large parameter LLM with smaller investment.


What I suggested before: auto. Check performance of deliverables. Iterate on the method’s hyperparameters calculated and provided in the job object.

1 Like

I want the model to generate Hyperlambda, would I benefit from having a lot of text that explains its mechanics?

You’re simply going into areas where I can’t give you any useful advice, except to advise you look at the size of the training sets of open-source tunings like StarCoder or other attempts - that are reinforcing what IS in the pretraining corpus with more like tens of thousands…and might be able to complete some language from a docstring before going off the rails. The built-in chat power isn’t going to add too much, I’m afraid..

Here’s pasting the background and your question into GPT-4.5, prompted as a general-purpose ML and coding expert, that can only answer your questions as you would logically answer them yourself now.

While having a lot of text that explains the mechanics of Hyperlambda can be beneficial for understanding the language, it might not directly improve the AI’s ability to generate Hyperlambda code.

The AI model learns from examples, not explanations. Therefore, to fine-tune the model to generate Hyperlambda, you would ideally need a large corpus of Hyperlambda code examples. The model learns the syntax, structure, and patterns from these examples and uses this knowledge to generate new code.

However, the explanatory text can be useful for the developers or data scientists who are working on the model. They can use the information to understand the language better and to create or curate a high-quality dataset for training the model.

In summary, for the AI model itself, a corpus of Hyperlambda code is more beneficial than explanatory text. For the human developers working on the model, both can be useful.

1 Like

Thank you, you’ve been great!

1 Like