Hello,
I have been working on developing a model that can understand and generate characters, aiming to train it on a dataset containing screenplays from approximately 3,000 movies with over 45,000 characters.
One of the key challenges I am facing is annotation. My initial plan was to perform several NLP tasks to annotate dialogues, entities, and emotions in the dataset. However, this has proven to be quite difficult, as most Named Entity Recognition (NER) models struggle with tagging special objects (e.g., Infinity Stones, Elder Wand) and unique locations (e.g., Pandora, Arrakis).
I am unsure how to best tackle this problem. One approach I am considering is using vector search on an enriched embeddings database, but I am not certain if the model will correctly tag these specialized terms.
I would love to hear expert insights on the best way to approach this. Since I am still a student in the learning stages, I am particularly interested in:
- Best practices for handling specialized entity recognition in screenplays.
- Potential APIs or tools that could assist with annotation and tagging.
- Alternative methods that are both computationally and financially efficient.
Any guidance on whether I am heading in the right direction or if there are smarter approaches would be greatly appreciated.
Thank you in advance for your time and insights!
Best regards,