Some significant experience with this as I wrote the first AI plugin for Discourse. This implemented topic summarisation and includes a smart (LLM-based) tagging feature.
Some conclusions from that experience:
lack of a confined list will likely result in your tag set becoming too large and too overlapping over time ending up with no discrete categories and too many synonyms - a mess.
Prompting with a defined set works well
the best results come from prompting with a Completion, not using embedding and semantic similarity (surprisingly), but that’s not what you asked.
gpt4-turbo is way better than the 4o series at this (unfortunately it’s much more expensive too, but you really get what you pay for in this task)
I’ve been considering distilling gpt4-turbo responses into a fine tuned mini but the distilling toolset is a bit raw and incomplete atm.