Auto-tagging articles - any thoughts?

parakeet · May 21, 2023, 9:17pm

Does anyone have any thoughts/experience/links on writing something to auto-tag blog posts, news articles etc? I’m keen to hear/read ideas.

I’d like automation help tagging thousands of articles.

ChatGPT did a nice job in response to:

"Give me a bunch of tags for this article… xxx
Separate the tags into different taxonomies:

Topics

Companies

People"

But, from one article to another, I noticed some some variance in the exact tag name used even though it should ostensibly be the same. That would be a problem for consistent tagging.

That prompted me to look into controlled vocabularies. IPTC Media Topics is the industry’s main such vocabulary, with 1,100 terms. I asked ChatGPT to use that, but it hallucinated - it cannot directly use that vocabulary. GPT 4 chat via Playground does a better job at this, but it still hallucinates, returning codes for Media Topics but incorrect terms.

Regardless, the results ChatGPT initially gave were actually preferable than something like IPTC Media Topics. Its understanding of language made for much richer, more granular suggested terms.

This creates a dilemma… if, instead of using a controlled vocabulary, I am to ask OpenAI to help me tag a mass of articles, how can I be confident it will use a consistent vocabulary?

Can “temperature” be used to influence this? I understand temperature corresponds to variability. If temperature is reduced or turned to 0, would this increase the chance the AI would use alternative labels each time?

Practically , I guess I would ask it to return results as CSV or JSON format.

Anyone else worked on something like this?

CreatiCode · May 21, 2023, 9:53pm

Just to throw an idea out there, maybe you can use GPT-4 again to consolidate the tags after it’s done with tagging individual articles? Something like this:

Here are the tags for the topics of 100 articles. Please consolidate similar tags so that there are fewer variations in the tags. Output a mapping from the current tags to the consolidated tags.

article 1: food, restaurant
article 2: dinner, cooking
…

rkaplan · May 21, 2023, 10:33pm

Have you tried adding a statement to your prompt such as “If an existing tag applies, use that exact tag name; do not create a new, similar name.”

parakeet · May 22, 2023, 5:46am

Hmm, that’s an idea, kind of organically creating a semi-controlled vocabulary.

I’m already doing something similar to identify whether in-bound messy “tags” are actually either a “company” or a “person”, then resetting the values in those taxonomies. My system will first look at those taxonomies to see if the values exist. If not, I fling the values at Google’s Natural Language engine to establish what they are.

The use case of evaluating the article text is a step-change. In your suggested method of trying to tether similarity to the existing sets of terms…

I wonder whether the AI would also bring variability to judging similarity to existing terms.
Starting with an empty taxonomy (no terms), I wonder how this would go.

Only way to know on both counts is to work something up, I guess. I’m working with WordPress.

I’m also interested in taxonomising the article type/format (ie interview/opinion etc). So, that’s potentially auto-taxonomising the following…

Topics
People
Companies
Events
Format

I’d like to think I could do this all in one prompt-response for each article, by asking it to return a single structured data object (CSV/JSON) containing each.

merefield · May 22, 2023, 5:56am

Yes. I have this working. I send a prompt including the list of possibilities and the articles summary (also generated by AI) to Davinci and it responds with some good suggestions which I automatically apply.

In my case I’m using Discourse and the code is open source:

See: Discourse AI Topic Summary : automated summaries and smart tagging - plugin - Discourse Meta

I sometimes restrict it to using the initial pool only as even though my prompt directs it not to, it sometimes suggests tags that don’t exist. However it is better than nothing and definitely reduces manual effort.

parakeet · May 22, 2023, 6:14am

Thanks.
Since each of my articles would likely exceed 2,048 text-davinci-003 tokens, I think I’d need to use GPT 3.5 or GPT4 via chat. Do you foresee any issues with that, beside extra cost?

I think what’s emerging is…

Article format: I would control this vocabulary ie. ("Possibilities are: “Interview’, ‘Analysis article’, ‘Opinion’, ‘Feature’” etc - in other words, pass it the existing values of my WordPress taxonomy).
People, Companies & Events: I think it would do a good job of simply plucking these out by itself.
Topics: Start it going all of its own accord, but, if I find tag variability, investigate constraining the process as @rkaplan suggested, so that, if the determined tag is “similar” to an existing one, the existing one is used. My thousands of articles are partially/messily tagged for topic, but I’m interested in an automation going back over the whole lot, so it would need to substantially build this vocabulary/taxonomy for itself from scratch.
My mind is boggling with how else I could divine tags from stories… Software processes, "Software categories mentioned in the articles, etc.

Here’s something for Topics I can see needing some thought, though…

Perhaps post Topic terms should only be set for topics that the post is substantially about. That is, the AI shouldn’t just classify everything that it sees in the article. In a test via ChatGPT (3), I saw it determine 8 topics for the story. Maybe that’s okay, and it was lovely to see it do so - but I wonder whether the thrust of the article was substantially about eight things in reality.

Interesting stuff, thanks.

rkaplan · May 22, 2023, 9:23am

You might get some ideas in this regard from books/blogs about indexing books. Early attempts at computer-generated indexes simply indexed every word - and that is still done in some situations such as legal deposition transcripts. That results in a “word concordance,” which has very different use cases from a traditional index.

A simple option might be to include in the prompt something like “Select up to 3 tags which best summarize the topics in the text.”

parakeet · May 30, 2023, 9:13pm

I think it boils down to two main methods:

Hope total auto-tagging is viable.
Define a taxonomy of allowed terms beforehand, with which to constrain the auto-tagging effort.

I think #2 would help to reduce variance. But I foresee some issues in feeding it my list of allowed terms, which could be quite long. On top of the story-to-tag, I think that would make for quite a lot of tokens.

I’m testing with GPT-4. Anyone think that’s overkill? Can I get away with something lesser, if it can handle a token count encompassing both story text (eg. 600 words) and allowed taxonomy terms (several dozen)?

parakeet · May 31, 2023, 11:08am

I’m struggling to get gpt-3.5-turbo to use terms only from my predefined list. It just goes ahead and uses its own terms as well.

Why won’t it listen?

Example prompt:

Can you please provide a list of relevant Topic terms for this article?
Choose only from your predefined Topics list.
The topic must be what the article is substantively about, rather than just a tangential or fleeting mention.

ARTICLE…

HEADLINE: Headline text

Story text…

I have also used a version which adds:

You only use the following Topic terms:

Something

Another thing

One more thing

Yet another thing

And I’ve used a version trying to set the following at the System level…

You are a software process for tagging articles by “Topic”.

You only use the following Topic terms:

Something

Another thing

One more thing

Yet another thing

My attempt at confining it to predefined terms is in the form of a Markdown-formatted, hierarchical bullet list.

merefield · May 31, 2023, 12:03pm

Understood. I used logic on the Ruby on Rails side to ignore suggestions that didn’t exist. I was unable to force Davinci to stay within bounds.

parakeet · May 31, 2023, 12:11pm

Oh, that’s a good idea - I should have thought of that. I’d be doing the same on WordPress - if term exists, set the term.
Seems like waste of tokens on the overflow, but it shouldn’t be too many.
Thanks.

CreatiCode · May 31, 2023, 12:32pm

Or you can map the terms outside your list to terms in your list? You can create the map manually or ask ChatGPT to do it. Alternatively, you can use embedding to find the closest terms on your list.

sps · May 31, 2023, 12:42pm

The most economical method seems to use embeddings.

Simply obtain and store embeddings for all the tags that are allowed/supported.
Then obtain embeddings for every article (use title or body etc per your requirements), check for cosine similarity with the store list of tags’ embeddings.
Use the top n matches as tags.

merefield · May 31, 2023, 12:50pm

Would an embedding for a single tag align with that of an article in the same way? If so that’s amazing. I will test that out when I have time.

Topic		Replies	Views
Fine tuning for use of keyword lists API fine-tuning	15	3435	December 10, 2023
Resolving ChatGPT hallucinations for text classification using IAB taxonomy Prompting gpt-4 , chatgpt	3	2543	July 23, 2023
Document Tagging 4o vs 4o-mini API	1	165	December 12, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10986	May 4, 2023
How I cluster/segment my text after embeddings process for easy understanding? API	13	14344	December 18, 2024

Auto-tagging articles - any thoughts?

Related topics