Seeking Solutions for Instability in Multi-Class Labeling Tasks

Using the chatGPT 16k model with a temperature setting of 0, when dealing with multi-class labeling tasks in a traditional chinese corpus, two issues are encountered:

  1. The output format is unstable. Even when specifying the format and providing examples, the output format can still change unpredictably.
  2. When the prompts are exactly the same, with only the output format changed from CSV to JSON, there will be noticeable pattern changes in the labeled results.

In order to use labeled results as a baseline for scientific research, a high degree of reproducibility is needed, and it is desired that the labeled results are not influenced by factors such as the specified output format. It is expected that multiple requests for the same text will consistently produce stable and consistent labeled results.

I’d like to ask if there are any similar experiences to share or any advice. thank you!

Welcome to the community!

100% adherence to a customized output format without fine-tuning is not possible. These systems are not designed nor built to provide the exact same results from the same prompt each time. It’s always going to have a degree of variability and unpredictability no matter what you do.

I would recommend fine tuning a model with examples of the format you’re trying to get it to do. So far, this sounds like a specialized use case that fine tuning would work well for.

Btw, if you’re trying to tag corpora using prompting, I hate to admit it, but post-processing with plain-old programming is still likely going to be the better bet here, again unless you fine-tune the instructions to your use-case.

These models are great for generating raw text to be placed in a corpus. They are not that great for handling and manipulating such text in precise, complex, replicable formats yet.


Welcome @jeoneungo

What is the prompt that you’re using?

1 Like

Thanks for the warm welcome and responses! Here’s my prompt template:

you are an expert in an *umbrella term*. Following the *Umbrella Term*:
    1."Subconcept 1: anchor words of subconcept 1",
    2."Subconcept 2: anchor words of subconcept 2",
    5."Subconcept 5: anchor words of subconcept 5",
    strict classify the sample 
    If it talk about subconcept mentioned above labeling sample to '1'
    If it did not talk about subconcept mentioned above labeling sample to '0'
    Each subconcept should receive one label.
    only return label like format : 1,1,1,0,0.

This task involves a 5-class labeling, and each request will include only one sample to ensure consistent output.

1 Like

So if I’m understanding this correctly, you’re trying to get an LLM to produce a fixed set of values, of which either are a 1 or a 0 ?

So you want all its outputs to strictly be [1,1,1,0,0,0] , or [1,0,1,0,1] for example, right?

Hmm… Are you using LangChain for this by any chance?

This doesn’t look like a prompt possible in a single-shot style. In theory, it should be possible for GPT to identify something within a sentence or phrase with a binary yes/no answer (1,0), but in terms of handling all of that in one prompt, making those determinations, and placing the correct values in order in a finite array like that correctly each time is not going to work.

Also, what is “it” here? LLMs need an almost superfluous level of clarity to operate optimally under their constraints. From your prompt, it’s not clear what “it” is (although I also understand this may have been a prompt in a different language at first, which could also cause communication breakdowns).

On a high-level reconfiguration of this setup, you would probably have to coax it into producing strictly a yes/no single word answer for each subconcept iteratively (meaning, you will not be able to feed it 5 subconcepts at once), and then go in post-processing to translate yes → 1, and no → 0.

This looks like something LangChain would actually work well for now that I see your prompt. However, note that it’s still going to be relatively complex. You will not be able to get away with 0 programming for this, if you want that exact formatted array.

These models are built to generate natural language. They can analyze patterns well, especially in language, so the use case can still work in theory. However, it is not built to produce numerical outputs in that fashion like that. You can either choose to work with yes/no in place of 1/0 and change the prompting approach, or create a program/some scripting work that will turn the model’s outputs into the format you want, but it will still need to be prompted differently and likely with multiple shots if you want to maximize clarity and verifiability.

Also, even if it did produce that format correctly, do not assume its determination was entirely accurate. It would need to be cross-checked and verified by a human, which, depending on your corpus size, can get exponentially high, and turn into a lot of work. As an unpaid intern handling corpus labeling at my uni at the time, I get that labeling corpus data is a pain.

Those are the limitations and capabilities right now from my experience. Once you decide on which approach you’d prefer, we’d love to continue assisting you. Otherwise, you will not be able to generate both accurate results and consistent formats through single-shot prompting alone in the way you provided.

1 Like

Indeed, I wish to restrict the output results to [1,1,1,0,0,0] or [1,0,1,0,1]. At present, I can maintain a consistent output format by sending one sentence per request, even though there are still approximately 1.2% of sentences that require re-labeling. However, the token cost has become a concern.

This is my exact concern. I’m doing this to compare the results with the application output of a well-constructed dictionary. As I mentioned in the question, the goal is to achieve a high level of reproducibility in this labeling process, so I prefer not to use fine-tuning or apply excessive manual customization to the model.

Your mention of other technical details and suggestions has been really helpful. For instance, when you said, ‘it’s not clear what “it” is and strictly a yes/no single word answer,’ it was particularly enlightening. Thank you very much!

1 Like

One of the problems with this prompt is that it uses negative prompting.

You can ask model for indices of the categorized subconcept(s) instead of the binary array.

There’s also some grammatical errors which add unpredictability to the mix.


Hi, Wu. I am currently met with the same problem you are experiencing, I am a first year phD student researching LLM4Security, maybe we can have a chat in wx(ccsnow127) :grinning:

I have done everything, like preparing data using tool . i finetuned babbage-002 with 800 datapoints which were already labelled.(i had 4 label 200 each data points). but my fine tuned model still returning random text, it is not classifying one of the label