Prompting with the chat/completions API against a large transcript file

(Long post and long question ahead, sorry!)

Here’s a (chat/completions API) question I hope someone here may be able to help with…

I have a large transcript file that looks something like the text below, except many more lines, over 1000 lines to be precise. Each line contains a speaker’s identity, time-stamps, and the phrase spoken (all separated by ‘|’ characters). Each transcript line is separated by a line-feed (“\n”).

[Watson Pritchard (Elisha Cook)|48.72|50.439]Their ghosts are moving tonight.
[Watson Pritchard (Elisha Cook)|50.45|52.529]Restless, hungry.
[Watson Pritchard (Elisha Cook)|53.99|55.4]May I introduce myself?
[Watson Pritchard (Elisha Cook)|56.38|57.529]I’m Watson Pritchard.

I’ve been using the openAI chat/completions API to (using plain English, natural language) prompt on the transcript, e.g. “Show me all the lines that mention ghosts.”

What I’ve done is break the large transcript up into smaller chunks (e.g. 120 line files) and then prompt the AI sequentially, one transcript file at a time. I pass in each transcript file’s text as part of the API call via the ‘role’ => ‘user’ => ‘content’ message.

This seems to work well, I provide other instructions w/ the API call via system messages to instruct the model to return precisely the lines it finds and only that data, formatted exactly as originally written, etc.

The issue is a) the method I describe here is expensive, I’m running up against token limits even when chunking the transcript into smaller files; and b) promoting with only one chunk of the transcript at a time doesn’t really allow the model to see the whole transcript in context all at once.

Questions:
Is this an improper use for openAI?

If this is within the API’s use cases, should I make the transcript text file into an embedding first? Would I be able to “embed” the entire transcript or would Istill have to chunk it up again?

I tried an experiment and just a single line of the transcript, sending it to the embedding API and the array of vectors that came back from just that one line was over 200 elements. If I do the whole transcript as one embed (is that even possible? i.e. are there limits on input to an embed) I’d get back a VERY large data set, possibly e.g. 10s of thousands of vectors.

Is embedding even the right approach?

(Conventional DB queries BTW won’t work here, as I want the user to be able to ask questions about the transcript using natural language.)

Short story, any pointers on “best practices” for approaching the task I describe above?

THANK YOU IN ADVANCE!

Hi and welcome to the Developer Forum!

The task you describe could be handled better in two ways, depends what you want out of it as to which .

Just a traditional search system that will look for keywords in your data and extract the lines that contain them, if that will surface you can quite simply implement that.

If you do NOT want that or you wanted to add a sentiment ability to your search, i.e. not just a word match but a similarity of ideas, ghosts would possibly match with ghost adjacent topics, ghouls, goblins perhaps, then you could embed your data into a vector database and perform a similarity search based on your users search term.

You could then augment that with a GPT model, by included the results from the former with the users question and then letting the AI build a response using the search results as context.

1 Like

Thank you for the welcome, and also thank you for the response.

embed your data into a vector database and perform a similarity
search based on your users search term.

If I understand correctly ( apologies if not) what you’re suggesting is a combination of conventional data querying (via a DB) and chatGPT API? Or do you mean somehow using the API to “embed” my transcript data and then use openAI to prompt against that?

I have considered augmenting with more conventional database searches for simple word matches but I want the user to be able to use natural language to prompt/query the transcripts. I’m not certain how I could extract the actual intent (e.g. what words for example the user really wants to query on) from a natural language prompt and then make that into a conventional SQL query?

Regardless, the issue is as exactly you describe, what if the user searches for something more abstract, e.g. an emotion or “sentiment” i.e. “Show me scenes where one character showed compassion for another” – that’s where the LLM excels, it seems to be able to abstract the user’s meaning and intent and find the matching transcript lines.

The issues then are how to a) give the model the whole transcript instea dof pieces at a time; and b) if there’s a way to pseudo “pre-train” the model to search my dataset (and that’s where I thought “embeddings” may be able to help?).

Again, grateful for the discussion and suggestions, and hoping there’s a solution to get the user experience and results I want, better than what I’m doing now (which again, is feeding parts of the transcript to the model via the chat/completions API, one chunk at a time against the user’s prompt).

So, OpenAI has a model called ada-002 and it can turn a string of text into a “vector” it contains the semantic meaning of the text in question, that vector is then stored in a database, much as you would any normal text.

Once you have split your source material into useful sized chunks (entire topic in itself) and stored the vectorised version of them into the database, you can then take the users query and either directly vectorise that or ask the AI to create a search term that better represents what the user “meant” to say (removes the user from directly influencing the results) and then you can query the database to find entries that are semantically similar to the users query, now in some cases that can be enough, you simply return that list of vectors and the plain text that created them and you’re done, but you can also go a step further and include the results as part of a new prompt for a GPT model that also includes the users original query with a request to use search results as context to answer the question.

2 Likes

I’m curious about this topic too. I’m working with encyclopedia articles specifically.

I really like this structure, BTW. I have an endpoint that will return me titles and descriptions (and sometimes topics) based off a keyword search. It kind of gets me to the starting point you are at @joro728. If I were to mimic your structure for the data I’m working with, it would probably look like:

[Acting (Citizendium)|10|10]Acting (to act): in the theatre is understood as the portrayal of the physical, emotional and mental complexities of a given character. While this seems a straight forward description, it is not at all clear that all theatre practitioners understand the exclusive nature of the term
[Voice Acting (Wikiversity)|10|10]—Vocal Expression Voice acting is the art of performing voice-overs, providing voices to represent a character, or to provide information to an audience or user. Examples include animated, off-stage, off-screen or non-visible characters in various works, including feature films, dubbed foreign language films, animated short films, television progra...
... etc.

One difference is that my chunks are basically each line. And for my use case, I want the model to return a subset that is relevant to the user’s question. The reason I want to use a GPT model is because there are a lot of similar and possibly duplicate results, which would require a lot of rule-based logic to sift through, and it seems like the GPT model could make some pretty good initial decisions about relevancy.

So, if I were to break down what you’ve described, @Foxalabs, into a set of steps:

  1. I would decide what would make a chunk useful to my problem and get an embedding vector for the chunk. You mentioned that having the chunk be somewhat of a singular topic, so maybe I would just vectorize the description.
  1. Using a database that’s specialized for storing vectors (I’ve heard the name Pinecone thrown around a lot) as keys, I’m assuming, and my text as the value?
  1. Perform the same process for the user’s query by getting an embedding vector for it.

    (optional). Give the GPT model a prompt like:

    f"Rewrite the query provided by the user to optimize it for searching using embeddings. The user provided the query: ```{u_query}```"
    

    and get an embedding vector for what the model generates—I’m assuming the two aren’t mutually exclusive?

  1. Something like Pinecone will then accept the embedding vectors and perform (insert magic word here) cosine similarity calculations across the database of embedding vectors, like a key lookup?

    (optional) Give the GPT model a prompt like:

    f"The user provided the query: ```{u_query}```. " + \
    (f"After optimizing the query, it was decided that the focus of the inquiry would be: ```{a_query}````. " if USE_AGENT_OPTIMIZATION else "") + \
    f"Additionally, you've been given some readily available information, delineated with triple hyphens, in order to better assist the user. Answer the user's query, making sure to first consider the information you've already been provided." + \
    f"---" + \
    f"{query_result}" + \
    f"---"
    

Then, having performed these steps, the response I get from the model will be one where the model has considered the context of all chunks in the database? I understand the model is unaware of the embeddings in this case. So, the “consideration” being performed is a combination of 1) identifying the chunks with vectors that are the most similar to the vector created from the user’s input and 2) injecting copies of the subset of chunks alongside the user’s original input?

Is this a correct description of the process?
Also, is the hard part of this approach the “decide what would make a chunk useful to my problem” part?

1 Like
  1. I would decide what would make a chunk useful to my problem and get an embedding vector for the chunk. You mentioned that having the chunk be somewhat of a singular topic, so maybe I would just vectorize the description.

Totally depends on your application, lets say you were chunking a PDF, it would create a better embedding dataset if you had chunk overlap, so some of the prior chunk and some of the next chunk included with the current chunk, this allows for sematic meaning across chunk boundaries.
But, if your dataset id compartmentalised then there is no need to do this. There are also situations where you might want to include a meta header with each chunk to allow for greater understanding of external and otherwise disconnected information. This cold be as simple as a page number or as complex as a book plot story arch.

  1. Using a database that’s specialized for storing vectors (I’ve heard the name Pinecone thrown around a lot) as keys, I’m assuming, and my text as the value?

Indeed, there are many providers of vector storage, Pinecone is a commercial example, there are Open Source alternatives like ChromaDB and Weaviate. These databases vary in implementation but they all have similar base features of storage and retrieval and typically some mechanism to link the vector embedding to the object that created it (in this case text).

The databases actually store the vector as a “key” value and with some mathematics the vectors can be compared to a search value, you can then request the top K results, like a top 10 but the term is Top_K.

  1. Perform the same process for the user’s query by getting an embedding vector for it.(optional). Give the GPT model a prompt like:
f"Rewrite the query provided by the user to optimize it for searching using embeddings. The user provided the query: ```{u_query}```"

and get an embedding vector for what the model generates—I’m assuming the two aren’t mutually exclusive?

Correct on all counts.

Something like Pinecone will then accept the embedding vectors and perform (insert magic word here) cosine similarity calculations across the database of embedding vectors, like a key lookup?

(optional) Give the GPT model a prompt like:

f"The user provided the query: ```{u_query}```. " + \
(f"After optimizing the query, it was decided that the focus of the inquiry would be: ```{a_query}````. " if USE_AGENT_OPTIMIZATION else "") + \
f"Additionally, you've been given some readily available information, delineated with triple hyphens, in order to better assist the user. Answer the user's query, making sure to first consider the information you've already been provided." + \
f"---" + \
f"{query_result}" + \
f"---"

Correct. It’s still basically magic, even to those who’ve studied it for years.

Then, having performed these steps, the response I get from the model will be one where the model has considered the context of all chunks in the database? I understand the model is unaware of the embeddings in this case. So, the “consideration” being performed is a combination of 1) identifying the chunks with vectors that are the most similar to the vector created from the user’s input and 2) injecting copies of the subset of chunks alongside the user’s original input?

Is this a correct description of the process?
Also, is the hard part of this approach the “decide what would make a chunk useful to my problem” part?

Essentially you tell the model that the vector retrieved content is exactly that, it will understand what it is and why you would do it, you can then further reenforce that by placing the search context in ### markers and formally telling the model this “Given the above vector embedding retrieval context in ### markers please answer the following query step by step {user_query}”

Deciding on your chunking can be super simple if your data is regular and formal/formatted , free forum human input can get interesting and may require additional AI passes to turn unstructured into structured.

3 Likes