Seeking Advice on Annotating and Developing Character Models

Hello,

I have been working on developing a model that can understand and generate characters, aiming to train it on a dataset containing screenplays from approximately 3,000 movies with over 45,000 characters.

One of the key challenges I am facing is annotation. My initial plan was to perform several NLP tasks to annotate dialogues, entities, and emotions in the dataset. However, this has proven to be quite difficult, as most Named Entity Recognition (NER) models struggle with tagging special objects (e.g., Infinity Stones, Elder Wand) and unique locations (e.g., Pandora, Arrakis).

I am unsure how to best tackle this problem. One approach I am considering is using vector search on an enriched embeddings database, but I am not certain if the model will correctly tag these specialized terms.

I would love to hear expert insights on the best way to approach this. Since I am still a student in the learning stages, I am particularly interested in:

  1. Best practices for handling specialized entity recognition in screenplays.
  2. Potential APIs or tools that could assist with annotation and tagging.
  3. Alternative methods that are both computationally and financially efficient.

Any guidance on whether I am heading in the right direction or if there are smarter approaches would be greatly appreciated.

Thank you in advance for your time and insights!

Best regards,

1 Like

Here are a few ideas to help with your project…

  • Custom NER model- Use libraries like spaCy or Hugging Face Transformers to fine-tune a model on a sample of your screenplay data. Manually label examples that include unique terms such as fictional objects and locations. This training can help the model pick up on terms that standard models miss.

  • Vector search with enriched embeddings- Build an embeddings database from your dataset so that the model can capture the context of specialized terms. This method can help match similar concepts even when the term is rare or unique.

  • Hybrid approach- Combine a rule-based method with machine learning. For example, maintain a list of known specialized terms and have your system cross-check the ML output against this list. This mix can catch terms that a pure ML model might overlook.

  • APIs and open-source tools- Explore tools such as Google Cloud NLP or open-source frameworks that allow customization. Although many of these tools target general language, modifying them to your dataset may improve tagging accuracy.

  • Start small- Test your approach on a smaller portion of your dataset before scaling up. This can help manage computational costs while you refine your process.

1 Like