@anna.sofia
Here is what I would try if I had to detect all metaphors in a huge text corpus, and my basic fine-tuned metaphor classifier wasn’t working.
Motivation
Limit usage of higher AI models such as GPT-4 because of their high cost. Only send new data to them.
Metaphor Example
“Life is like a box of chocolates.”
Logical definition of metaphor
(A is like B) and (A is not B)
Problem with initial approach
Sending (A is like B) to a classifier of ‘0’ – not metaphor or ‘1’ – is a metaphor will not work for a variety of reasons. Namely (A is like B) with A and B varying over all possible phrases and words will take up a large chunk of the underlying embedding space. This appears as noise to the AI, especially without a TON of training data.
Two step filtering strategy with embeddings (Step 1)
The first is simple. Have a database of all known metaphors. You could start with 200 or 2000 or whatever. Just embed these and store them, along with the embedding vector in a database.
As you parse the large corpus, embed each sentence, and correlate this embedding vector with the set of known metaphors. If the embedding is close, no need to send to GPT-4, just declare this to be a metaphor.
Two step filtering strategy with embeddings (Step 2)
Since we cannot get a reliable metaphor classification out of (A is like B), what if we tried to extract all “metaphoric verbs” and create a filter based on those. Then we could send sentences that resonate with these verbs to a GPT-4 engine for confirmation of the presence of a metaphor.
Example training for fine tune to extract “metaphoric verbs”:
“Life is like a box of chocolates.” → " is like"
“Love, like a brilliant beacon, illuminates the darkest corners of our hearts.” → " like a"
Use Ada or Babbage to lower both cost and latency. Maybe go up to Curie if you need to bootstrap with lower training data, then over time, as your training data grows, train lower and lower models with the larger set of training data.
Don’t forget that Regex is your friend!
You could also, in parallel search for these using a simple substring search routine.
Search for metaphoric verbs using the simple substring method. Also look at the output of a classifier.
Filter decision time
For the classifier, embed the output, so embed the “is like”. If this correlates closely to the list of known “metaphoric verbs”, then send it to GPT-4 for metaphor classification, only if they are not already correlated (embedding wise) to a currently known metaphor.. Essentially you are creating resonant nodes for your filter using the metaphoric verbs, and the comparison is through embeddings. This training might have a shot, since the logic is to extract these verbs. I’d be curious to see if it works!
For the substring search, or “regex”, send these to GPT-4 as well, only if they are not already correlated (embedding wise) to a currently known metaphor.
Lifecycle
Now that you have more metaphors, if you agree that these newly detected metaphors are real, add them to your metaphor database. Why? Well, it adds to your filter, and avoids downstream use of GPT-4, saving you $$$.
Revisit
Now that you have this HUGE set of metaphors, try fine-tuning a model to detect them like you did initially. Does it work now? Can you get any model to be trained to work? I suspect it’s possible with a lot of training data. But now with this larger set of data, you are getting the model to recognize the metaphoric verbs. Is this cost worth it?