Training gpt-3.5 to autocomplete for a niche domain and a specific writing style

Hi,

I’m looking to setup a “autocomplete” writing assistant that can complete my sentences/paragraphs. Kind of like Github Copilot but for my writing. Would appreciate any help or pointers of how to go about this.

Most of my writing is for a particular domain and has to conform to a particular writing style.

For finetuning, do I just create chunks of incomplete text as the instruction and the completion as the response? I’d also like to do infilling. Was wondering if breaking the prompt into a prefix and suffix section with some kind of tags and then using the infill as the response would be the way to go?

How many instruction-completion pairs would I need for it work. Do I need to create multiple chunk-response pairs per document so the model gets what I’m trying to do, or will it be able to infer what I want it to do if I just make one chunk per document, provided I randomise how I chunk the documents (as in the chunking point is not always x words from the beginning etc?

Will the model be able to pick up sufficient knowledge of the domain to actually autocomplete accurately, or would it be better to train it with RAG baked into the training samples i.e. RAG context is part of the “autocomplete this” instruction? There are quite a few “definitions” and “concepts” that keep repeating in my dataset - maybe a few hundred, but like I said, they repeat with more or less standard wording through most of the documents.

I’d rather not have to do RAG in my training set as it would increase the training cost and also make dataset creation quite a bit more complicated for me.

Thanks for any help.

Hi @bharat21 - welcome to the Forum. Quite a few questions you got here. I’ll get started with my perspectives on some of them.

I don’t think you have much of a choice but to go with a hybrid model where you rely on RAG to inject your domain specific concepts and definitions and then the fine-tuned model for the “autocompletion” in your desired writing style. This means, you’d have to bake in retrieval results into your fine-tuning data set.

In terms of training samples, I’d start with perhaps 40-50 data pairs and then see where that takes you and whether to expand it further. You could start with a higher number but personally I always first validate my training approach with a smaller data set to avoid going down a rabbit hole. If it’s just one particular writing style, the fine-tuned model should pick it up fairly quickly. However, if there is a lot of diversity in terms of the initial incomplete sentence/paragraph and the resulting output, then you likely need to opt for a larger data set with a sufficiently balanced representation of different examples. There is no exact science to the size of training sets and often a bit of trial and error is necessary.

For the data set, I would in principle go ahead as you suggest and use the incomplete text as the user message plus the retrieval results to accommodate for the additional concepts/definitions and then the completion (which I interpret as the appended final set of sentences / paragraphs) as the assistant message. I’d suggest adding a system message to provide the model with some additional instructions on what you are trying to achieve including the persona that you’d like the model to adopt (e.g. an expert in XYZ) and how the concepts/definitions are expected to be incorporated into the response (note that if you do include a system message in the training set, you then also need to use that message when you call the fine-tuned model).

This would be my initial take on it. Perhaps others have additional thoughts.

Hi @jr.2509
Thanks for the response. Couple of questions.
Any suggestions on how I could control generation length?
If I just did one “split” per document, the model might learn to generate the remaining portion in its entirety. But I’d like a little flexibility in terms having it just generate one sentence or a paragraph maybe.
Or would be easier to use something like a full stop as a stopping token?

Hi again - as for the generation length, there’s a couple of factors that would influence it including:

1. max tokens: that’s particularly useful if you want to constrain the output to a certain length. On the flipside, if you set it to a higher value, this can help to yield longer outputs, provided the prompt is articulated in the right way.

2. temperature: not sure if applicable to your specific use case but as a higher temperature yields more diverse and creative outputs, this is often associated with longer output length (again somewhat dependent on the prompt).

3. Prompt wording: You should use the prompt itself to articulate the characteristics of the output including the length. While models don’t respond very accurately to exact word count, you can indicate the number of paragraphs or sentences you are looking for; in case of more detailed outputs, you can instruct the model structure the output in sections/sub-sections etc.

4. Fine-tuning training set: Your training set should be reflective of the desired output length.

It’s a bit tough to provide too specific recommendations in abstract without knowing more about your use case and taking a look at more specific examples.

In general though, a model is unlikely to write a full document for you in one API call based on a few sentences of input. For such an undertaking, you would typically follow an iterative approach whereby you first define the sections of a document and then have the model through a series of API calls flesh out the content of each section.