Tagging large number of narratives while maintaining context

I have hundreds of narratives (short paragraphs) that I want to tag with one or multiple words.

Even when using the gpt-3.5-turbo-16k, the context window is still not large enough to fit 50-100 pages of narratives.

All narratives need to be submitted before generating the tags because some tags may apply to multiple narratives (full context need to be provided in my case).

I’m using the chat completion api endpoint.

https://api.openai.com/v1/chat/completions

I’m using node js. Here is the backend prompt:

      const { data } = await openAiApi.post(`/chat/completions`, {
        model: "gpt-3.5-turbo-16k",
        messages: [
          {
            role: "system",
            content:
              "You are an experienced qualitative researcher and text analyst. You are working on a project where you need to analyze a large number of short narratives.",
          },
          { role: "user", content: content },
        ],
        temperature: 0.4,
      });

The content of the user message is:

const content = `
You will be provided with short narratives. Create qualitative tags for each narrative. The tags should help the researcher organize and combine narratives based on common themes. The goal is to to group narratives that have a common theme.
Provide your output in json format (array of objects) in the following format:
[
  {
    "label": "<insert tag label here>",
    "comment": "<insert comment here>",
    "color": "<insert color here>",
    "narrativesIds": ["<id1>", "<id2>", "<id3>"]
  }
]

Where label is the tag label you come up with, comment is a short description of the label, narrativesIds are the ids of the narratives that this label applies to.
For the color field, choose a random value from the following list: purple, teal, orange, amber, brown, grey, deepOrange, blueGrey, yellow, red, pink, cyan, lightBlue, green, indigo, blue, lime.

The narratives are:

${narratives.join("\n\n")}

Output a JSON of all the tags for all narratives. Remember, the same tag could be used for multiple narratives. The tags should be unique and not repeated. Only return the list of tags as JSON with no other words.
`;

I want to output the response as JSON in order to persist it in a database.

Sometimes I have 200 hundred narratives which exceeds the context window. What is the best way to provide the model with ALL narratives (hundreds)? Is there a way to maintain previous context and submit chunks of narratives one at a time? Any help is appreciated!

Your idea and demands on the AI don’t really fit a sustainable product or database that can be maintained or abridged. One mis-step by the AI and you get to do it all again.

I’ve tried to follow the prompt and determine your goal, and if I can’t do it, the AI will also give capricious output. “use the same tags” - “tags must be unique” = what?

If you want a unique label or ID for a database, generate that yourself - a truncated MD5sum or such.

Submitted for your consideration, prompt. Just call one at a time and batch them (“narrative” text to process stolen from another post):

for base models, only expensive text-davinci was able to form a tuple reliably (an output format with less token waste), so we use gpt-chat models.

For “color”, if you don’t mind continuous colors, you can submit to an embeddings, and pull out some dimensions and transform topical similarity into color values (this is a concept I’ve been working on, which needs some treatment to not just turn 1000 dimensions into a uniform gray.)

Thank you for the reply.

“use the same tags” - “tags must be unique” = what?

What I meant here is that each tag should be unique (e.g., Technology shouldn’t be repeated twice). However, the Technology tag could be applied to more than one narrative (that’s what the narrativesIds field is for).

If you want a unique label or ID for a database, generate that yourself - a truncated MD5sum or such.

I already have ids for narratives that I wanted to submit to the model in order to apply the tag to the correct narrative.

I agree, my method isn’t reliable and if the resulting JSON isn’t parsable for some reason, all the tokens are wasted.

Just call one at a time and batch them

You mean submit each narrative individually and then create the JSON manually? That works but the whole idea is to take into account ALL narratives to apply the tags correctly. But maybe I’m overthinking this. I thought that not providing ALL narratives at once would lead to incomplete context and hence inaccurate tags. But maybe in this case, context isn’t as important as I thought.

I’m using NodeJS so instead of a tuple, is an array the best data structure (least amount of tokens necessary) to store the result?

Thanks again for your response!

If the AI is allowed to make its own categories, about the only thing one might do is specify the input domain when going one-by-one. Tell the AI that it is processing a set of physics papers, and you might get different categories within physics, but just ask for a category, and all of them might be simply tagged “physics” which doesn’t serve much purpose.

If you ask for three possible categories for each, it is unlikely to produce three of the same, and it could even be discouraged from doing so with the repetition and presence penalty parameter. gpt-3.5 is pretty smart on its own in making what you’d want instead of just making unpredictable tokens of any type.

It sounds like the “IDs” to try to match things up later is just going to confuse it more, when it would not be needed when processing through one item at a time for the creation of additional database fields to be generated by AI. Also, more expensive 16k and hoping that it follows through on every single item loaded into a large context seems the wrong path, when it could be cheaper to reliably process each entry individually.

It seems ultimately you are asking for AI categorization of multiple texts. The AI call can be part of adding each to a database via whatever software solution you’ve concocted.

You basically then just have the choice of providing it your own limited categories it must fit within (or even rank the applicability to each category), or letting it go free so it can even put “field sports” when it finds a soccer article within your science papers.