I’m looking to setup a “autocomplete” writing assistant that can complete my sentences/paragraphs. Kind of like Github Copilot but for my writing. Would appreciate any help or pointers of how to go about this.
Most of my writing is for a particular domain and has to conform to a particular writing style.
For finetuning, do I just create chunks of incomplete text as the instruction and the completion as the response? I’d also like to do infilling. Was wondering if breaking the prompt into a prefix and suffix section with some kind of tags and then using the infill as the response would be the way to go?
How many instruction-completion pairs would I need for it work. Do I need to create multiple chunk-response pairs per document so the model gets what I’m trying to do, or will it be able to infer what I want it to do if I just make one chunk per document, provided I randomise how I chunk the documents (as in the chunking point is not always x words from the beginning etc?
Will the model be able to pick up sufficient knowledge of the domain to actually autocomplete accurately, or would it be better to train it with RAG baked into the training samples i.e. RAG context is part of the “autocomplete this” instruction? There are quite a few “definitions” and “concepts” that keep repeating in my dataset - maybe a few hundred, but like I said, they repeat with more or less standard wording through most of the documents.
I’d rather not have to do RAG in my training set as it would increase the training cost and also make dataset creation quite a bit more complicated for me.
Hi @bharat21 - welcome to the Forum. Quite a few questions you got here. I’ll get started with my perspectives on some of them.
I don’t think you have much of a choice but to go with a hybrid model where you rely on RAG to inject your domain specific concepts and definitions and then the fine-tuned model for the “autocompletion” in your desired writing style. This means, you’d have to bake in retrieval results into your fine-tuning data set.
In terms of training samples, I’d start with perhaps 40-50 data pairs and then see where that takes you and whether to expand it further. You could start with a higher number but personally I always first validate my training approach with a smaller data set to avoid going down a rabbit hole. If it’s just one particular writing style, the fine-tuned model should pick it up fairly quickly. However, if there is a lot of diversity in terms of the initial incomplete sentence/paragraph and the resulting output, then you likely need to opt for a larger data set with a sufficiently balanced representation of different examples. There is no exact science to the size of training sets and often a bit of trial and error is necessary.
For the data set, I would in principle go ahead as you suggest and use the incomplete text as the user message plus the retrieval results to accommodate for the additional concepts/definitions and then the completion (which I interpret as the appended final set of sentences / paragraphs) as the assistant message. I’d suggest adding a system message to provide the model with some additional instructions on what you are trying to achieve including the persona that you’d like the model to adopt (e.g. an expert in XYZ) and how the concepts/definitions are expected to be incorporated into the response (note that if you do include a system message in the training set, you then also need to use that message when you call the fine-tuned model).
This would be my initial take on it. Perhaps others have additional thoughts.
Hi @jr.2509
Thanks for the response. Couple of questions.
Any suggestions on how I could control generation length?
If I just did one “split” per document, the model might learn to generate the remaining portion in its entirety. But I’d like a little flexibility in terms having it just generate one sentence or a paragraph maybe.
Or would be easier to use something like a full stop as a stopping token?
Hi again - as for the generation length, there’s a couple of factors that would influence it including:
1. max tokens: that’s particularly useful if you want to constrain the output to a certain length. On the flipside, if you set it to a higher value, this can help to yield longer outputs, provided the prompt is articulated in the right way.
2. temperature: not sure if applicable to your specific use case but as a higher temperature yields more diverse and creative outputs, this is often associated with longer output length (again somewhat dependent on the prompt).
3. Prompt wording: You should use the prompt itself to articulate the characteristics of the output including the length. While models don’t respond very accurately to exact word count, you can indicate the number of paragraphs or sentences you are looking for; in case of more detailed outputs, you can instruct the model structure the output in sections/sub-sections etc.
4. Fine-tuning training set: Your training set should be reflective of the desired output length.
It’s a bit tough to provide too specific recommendations in abstract without knowing more about your use case and taking a look at more specific examples.
In general though, a model is unlikely to write a full document for you in one API call based on a few sentences of input. For such an undertaking, you would typically follow an iterative approach whereby you first define the sections of a document and then have the model through a series of API calls flesh out the content of each section.
I have. been working on something similar with regard to convincing a model to emulate a writing style. I have been dropping the system and user prompts entirely from my JSONL, and that seems to lead to least disruption with making the model cosmically stupid. But the resulting fine-tunes are not readily distinguishable in style from the base model. Are you suggesting adding a system prompt back in might help?
I think a system prompt can be helpful to further steer the model towards the desired style, as you can for example assign a specific persona, which tends to be quite powerful. But the examples in the training data provided are just as important.
Also, note that if you do include a system prompt in the training file, you will also need to include a system message later on when you consume the fine-tuned model otherwise you risk inconsistencies in performance.
I tried a Q&A format initially, with user prompts being questions, and bot responses being quotes from the text, but that led to the model getting random very quickly (like in 50-100 examples). I have had better luck so far with omitting both user and system prompts. That is, the model seems to retain more general responsiveness, but the style shift isn’t remarkable. I will try the system prompt you suggest. Thanks for the suggestion!
So, mixed success. I tried with the system prompt and full paragraphs of text, and I get paragraphs that made sense about 80% of the time. I tried with the system prompt and individual sentences, and got more fragmented, even worse responses. With and without adding the system prompt to the retrieval instructions. Oh well.
I am now wrestling with the, perhaps incorrect, intuition that less training may be better. Currently running at 50 examples and 3 epochs.
help me understand: what all are you trying to achieve with the fine-tuning. I understand you want the model to respond in a certain style. is there anything else?
if you could share just a few more details on your specific use case, that would help a lot. thanks!
Thanks for your interest and taking the time. I am trying to get it to respond in a certain style to general prompts. That is, “write me a paragraph about paris.” or "write me a paragraph about imposter syndrome. " The end-result I hope for is a chatbot that can respond to various requests with a specific style and tone of voice. 80% more or less correct is too low a bar for something I may want to make public. Is that helpful?
Given 4o is pretty capable, I might be able to get good results with prompt engineering and a couple-shot examples, but trying to capture more nuance. Also, of course, trying to learn something
no worries. that’s helpful. And one more question, just to make sure I am on the same page: in your latest fine-tuning attempt, what exactly was included in your training data?
if you can, a training example would be helpful.
In any case, what you are trying to achieve should be absolutely doable. It’s probably just about tweaking your training data a little bit more.
Here’s a sample JSONL line. There is some unicode in there, but I don’t think that would necessarily be an issue. It is basically system prompt, assistant response with a full paragraph. This is the configuration that I get about 80% results. I may get the beginning of a sensible paragraph out of the trained model, but it sometimes devolves into gibberish.
{“messages”: [{“role”: “system”, “content”: “You are a helpful leadership and coaching expert that responds in an academic, but accessible style.”}, {“role”: “assistant”, “content”: “Nobody knows the future. Machines have already mastered creativity\u2014they can produce decent art, music, poetry or essays. And it\u2019s foreseeable that machines will be able to communicate and collaborate. But critical and contextual thinking? I don\u2019t think so. At least not for a long time to come. These are skills are what this writing aims to enhance. And I bet people will still need is how to communicate and collaborate effectively (even if it\u2019s with machines).”}]}
I would include user message with the corresponding question so that you can better mimic you how you expect to use the model in the future.
Your system message is a bit generic. I think you could elaborate a little bit more what you are specifically looking for in terms of style. The example assistant message gives an idea of the style but could still be a bit more distinctive.
Thanks! The first time I tried Q&A, it was with sentences rather than paragraphs. Catastrophe. I have been using chatGPT to do synthetic questions; maybe I can do better with the prompt on that.