Auto-tagging articles - any thoughts?

Does anyone have any thoughts/experience/links on writing something to auto-tag blog posts, news articles etc? I’m keen to hear/read ideas.

I’d like automation help tagging thousands of articles.

ChatGPT did a nice job in response to:

"Give me a bunch of tags for this article… xxx
Separate the tags into different taxonomies:

  • Topics
  • Companies
  • People"

But, from one article to another, I noticed some some variance in the exact tag name used even though it should ostensibly be the same. That would be a problem for consistent tagging.

That prompted me to look into controlled vocabularies. IPTC Media Topics is the industry’s main such vocabulary, with 1,100 terms. I asked ChatGPT to use that, but it hallucinated - it cannot directly use that vocabulary. GPT 4 chat via Playground does a better job at this, but it still hallucinates, returning codes for Media Topics but incorrect terms.

Regardless, the results ChatGPT initially gave were actually preferable than something like IPTC Media Topics. Its understanding of language made for much richer, more granular suggested terms.

This creates a dilemma… if, instead of using a controlled vocabulary, I am to ask OpenAI to help me tag a mass of articles, how can I be confident it will use a consistent vocabulary?

Can “temperature” be used to influence this? I understand temperature corresponds to variability. If temperature is reduced or turned to 0, would this increase the chance the AI would use alternative labels each time?

Practically , I guess I would ask it to return results as CSV or JSON format.

Anyone else worked on something like this?

Just to throw an idea out there, maybe you can use GPT-4 again to consolidate the tags after it’s done with tagging individual articles? Something like this:

Here are the tags for the topics of 100 articles. Please consolidate similar tags so that there are fewer variations in the tags. Output a mapping from the current tags to the consolidated tags.

article 1: food, restaurant
article 2: dinner, cooking

Have you tried adding a statement to your prompt such as “If an existing tag applies, use that exact tag name; do not create a new, similar name.”

Hmm, that’s an idea, kind of organically creating a semi-controlled vocabulary.

I’m already doing something similar to identify whether in-bound messy “tags” are actually either a “company” or a “person”, then resetting the values in those taxonomies. My system will first look at those taxonomies to see if the values exist. If not, I fling the values at Google’s Natural Language engine to establish what they are.

The use case of evaluating the article text is a step-change. In your suggested method of trying to tether similarity to the existing sets of terms…

  1. I wonder whether the AI would also bring variability to judging similarity to existing terms.
  2. Starting with an empty taxonomy (no terms), I wonder how this would go.

Only way to know on both counts is to work something up, I guess. I’m working with WordPress.

I’m also interested in taxonomising the article type/format (ie interview/opinion etc). So, that’s potentially auto-taxonomising the following…

  • Topics
  • People
  • Companies
  • Events
  • Format

I’d like to think I could do this all in one prompt-response for each article, by asking it to return a single structured data object (CSV/JSON) containing each.

Yes. I have this working. I send a prompt including the list of possibilities and the articles summary (also generated by AI) to Davinci and it responds with some good suggestions which I automatically apply.

In my case I’m using Discourse and the code is open source:

See: Discourse AI Topic Summary : automated summaries and smart tagging - plugin - Discourse Meta

I sometimes restrict it to using the initial pool only as even though my prompt directs it not to, it sometimes suggests tags that don’t exist. However it is better than nothing and definitely reduces manual effort.

Thanks.
Since each of my articles would likely exceed 2,048 text-davinci-003 tokens, I think I’d need to use GPT 3.5 or GPT4 via chat. Do you foresee any issues with that, beside extra cost?

I think what’s emerging is…

  • Article format: I would control this vocabulary ie. ("Possibilities are: “Interview’, ‘Analysis article’, ‘Opinion’, ‘Feature’” etc - in other words, pass it the existing values of my WordPress taxonomy).
  • People, Companies & Events: I think it would do a good job of simply plucking these out by itself.
  • Topics: Start it going all of its own accord, but, if I find tag variability, investigate constraining the process as @rkaplan suggested, so that, if the determined tag is “similar” to an existing one, the existing one is used. My thousands of articles are partially/messily tagged for topic, but I’m interested in an automation going back over the whole lot, so it would need to substantially build this vocabulary/taxonomy for itself from scratch.
  • My mind is boggling with how else I could divine tags from stories… Software processes, "Software categories mentioned in the articles, etc.

Here’s something for Topics I can see needing some thought, though…

Perhaps post Topic terms should only be set for topics that the post is substantially about. That is, the AI shouldn’t just classify everything that it sees in the article. In a test via ChatGPT (3), I saw it determine 8 topics for the story. Maybe that’s okay, and it was lovely to see it do so - but I wonder whether the thrust of the article was substantially about eight things in reality.

Interesting stuff, thanks.

You might get some ideas in this regard from books/blogs about indexing books. Early attempts at computer-generated indexes simply indexed every word - and that is still done in some situations such as legal deposition transcripts. That results in a “word concordance,” which has very different use cases from a traditional index.

A simple option might be to include in the prompt something like “Select up to 3 tags which best summarize the topics in the text.”

I think it boils down to two main methods:

  1. Hope total auto-tagging is viable.
  2. Define a taxonomy of allowed terms beforehand, with which to constrain the auto-tagging effort.

I think #2 would help to reduce variance. But I foresee some issues in feeding it my list of allowed terms, which could be quite long. On top of the story-to-tag, I think that would make for quite a lot of tokens.

I’m testing with GPT-4. Anyone think that’s overkill? Can I get away with something lesser, if it can handle a token count encompassing both story text (eg. 600 words) and allowed taxonomy terms (several dozen)?

I’m struggling to get gpt-3.5-turbo to use terms only from my predefined list. It just goes ahead and uses its own terms as well.

Why won’t it listen?

Example prompt:

Can you please provide a list of relevant Topic terms for this article?
Choose only from your predefined Topics list.
The topic must be what the article is substantively about, rather than just a tangential or fleeting mention.

ARTICLE…

HEADLINE: Headline text

Story text…

I have also used a version which adds:

You only use the following Topic terms:

  • Something
    • Another thing
      • One more thing
      • Yet another thing

And I’ve used a version trying to set the following at the System level…

You are a software process for tagging articles by “Topic”.

You only use the following Topic terms:

  • Something
    • Another thing
      • One more thing
      • Yet another thing

My attempt at confining it to predefined terms is in the form of a Markdown-formatted, hierarchical bullet list.

1 Like

Understood. I used logic on the Ruby on Rails side to ignore suggestions that didn’t exist. I was unable to force Davinci to stay within bounds.

Oh, that’s a good idea - I should have thought of that. I’d be doing the same on WordPress - if term exists, set the term.
Seems like waste of tokens on the overflow, but it shouldn’t be too many.
Thanks.

1 Like

Or you can map the terms outside your list to terms in your list? You can create the map manually or ask ChatGPT to do it. Alternatively, you can use embedding to find the closest terms on your list.

The most economical method seems to use embeddings.

  1. Simply obtain and store embeddings for all the tags that are allowed/supported.
  2. Then obtain embeddings for every article (use title or body etc per your requirements), check for cosine similarity with the store list of tags’ embeddings.
  3. Use the top n matches as tags.
2 Likes

Would an embedding for a single tag align with that of an article in the same way? If so that’s amazing. I will test that out when I have time.