(Long post and long question ahead, sorry!)
Here’s a (chat/completions API) question I hope someone here may be able to help with…
I have a large transcript file that looks something like the text below, except many more lines, over 1000 lines to be precise. Each line contains a speaker’s identity, time-stamps, and the phrase spoken (all separated by ‘|’ characters). Each transcript line is separated by a line-feed (“\n”).
[Watson Pritchard (Elisha Cook)|48.72|50.439]Their ghosts are moving tonight.
[Watson Pritchard (Elisha Cook)|50.45|52.529]Restless, hungry.
[Watson Pritchard (Elisha Cook)|53.99|55.4]May I introduce myself?
[Watson Pritchard (Elisha Cook)|56.38|57.529]I’m Watson Pritchard.
I’ve been using the openAI chat/completions API to (using plain English, natural language) prompt on the transcript, e.g. “Show me all the lines that mention ghosts.”
What I’ve done is break the large transcript up into smaller chunks (e.g. 120 line files) and then prompt the AI sequentially, one transcript file at a time. I pass in each transcript file’s text as part of the API call via the ‘role’ => ‘user’ => ‘content’ message.
This seems to work well, I provide other instructions w/ the API call via system messages to instruct the model to return precisely the lines it finds and only that data, formatted exactly as originally written, etc.
The issue is a) the method I describe here is expensive, I’m running up against token limits even when chunking the transcript into smaller files; and b) promoting with only one chunk of the transcript at a time doesn’t really allow the model to see the whole transcript in context all at once.
Questions:
Is this an improper use for openAI?
If this is within the API’s use cases, should I make the transcript text file into an embedding first? Would I be able to “embed” the entire transcript or would Istill have to chunk it up again?
I tried an experiment and just a single line of the transcript, sending it to the embedding API and the array of vectors that came back from just that one line was over 200 elements. If I do the whole transcript as one embed (is that even possible? i.e. are there limits on input to an embed) I’d get back a VERY large data set, possibly e.g. 10s of thousands of vectors.
Is embedding even the right approach?
(Conventional DB queries BTW won’t work here, as I want the user to be able to ask questions about the transcript using natural language.)
Short story, any pointers on “best practices” for approaching the task I describe above?
THANK YOU IN ADVANCE!