Tips for searching large text files via API?

I have been continuing to evolve a meeting minutes tool based on the OpenAI tuturial, using “gpt-4-1106-preview” via the API.

I’ve tailored the tool to provide minutes from company earnings call transcripts. I’ve been able to get pretty consistent results, with one exception - searching the transcript for specific terms. It often misses certain terms. Yet if I make the same query through the ChatGPT4 web version, I consistently get good results.

A specific example is using the Union Pacific Q3 earnings call transcript. It is 75KB. I ask GPT to provide a summary, and then ask it to search for a handful of specific terms:
Locomotives
Train Length
Remote Control
Distributed Power
Energy Management

The first 3 are mentioned deep in the transcript, the last 2 are not. GPT via the API often misses “Remote Control”. Via the web, ChatGPT4 consistently gets all 3.

Here are the prompts I’m using with the API:

    # Initialize with transcription
    content = "You are a helpful and highly skilled AI trained in language comprehension and summarization. Follows is a transcription of a business meeting.  I will be asking you questions about this meeting. For each request, go back and evaluate the full transcript carefully before responding. Do not reply to this initial message - wait for further instructions.  Here is the transcription: "
    content += transcription
    message_history.append({"role": "system", "content": content})
    response = gpt_minutes(message_history)
    print(response)

    header = "\n --> Summary ***\n"
    message_history.append({"role": "user", "content": "1 - Summary:  Please read the transcription provided above of a meeting and summarize it into a concise abstract paragraphs. Aim to retain the most important points, providing a coherent and readable summary that could help a person understand the main points of the discussion without needing to read the entire text. Please avoid unnecessary details or tangential points. The meeting may be a company earnings call with analysts. In this case, provide 2 summary paragraphs - a paragraph summarizing the company officers report in the first part of the call, and a 2nd paragraph summarizing the question and answer session with analysts in the latter part of the call.  Title this response --->SUMMARY<---"})
    response = gpt_minutes(message_history)
    print(response)
    minutes = header
    minutes += response

    header = "\n\n --> Items Of Interest ***\n"
    with open('ioi.txt', 'r') as file:  # Read items of interest from file
        ioi = file.read()
    ioi = ioi.replace('\n', ', ').strip(', ')  # Replace line breaks with commas
    content = "2 - Items of Interest:  You are a helpful and very talented AI trained to search and analyze meeting transcripts.  Carefully search the transcript provided for the following terms.  For each of the terms found, do 2 things: 1) repeat the quote where the term was used, and 2)explain your expert interpretation of what was meant by the discussion about that term.  If a term is not mentioned, only state that it is not mentioned - do not provide any commentary or interpretation of that term.  When complete, review the transcription again to ensure none of the specified terms were missed.  Provide the output in an organized way. Title this response --->ITEMS OF INTEREST<--- Here is the list of terms: "
    content += ioi
    message_history.append({"role": "user", "content": content})
    response = gpt_minutes(message_history)
    print(response)
    minutes += header
    minutes += response

Here is the API output for the Items of Interest:

---ITEMS OF INTEREST---

1) Locomotives:
Quote: "So when I look at Union Pacific, what do I look at? At a high level, I look at do we have the physical plant to be able to handle the traffic and be able to handle the ups and downs that every railroader knows happens with weather, like who thought we were ever going to get a hurricane in the West Coast. Okay, that's always an eastern seaboard issue more and a gulf issue but not a western. But I think we did, as a team, we did a great job of recovering. So you need a strong network, and we have the capacity there. We'll continue to invest to make sure. So that's important to me. And we have to make sure that we have a buffer of people and assets so that we're ready for the ups and downs of the business that happens. You know, if you add an extra intermodal train or you add an extra thousand feet on one of our locomotive trains, that's an easier fix."
Interpretation: The discussion about locomotives reflects the company's focus on ensuring they have the necessary assets to manage fluctuations in business volumes and unexpected events, such as weather disruptions. The mention of a "buffer of people and assets" indicates a strategy to maintain a reserve of operational resources, including locomotives, to quickly respond to changes in demand or operational challenges.

2) Train Length:
Quote: "Train length improved 1% compared to third quarter 2022, despite lower volumes in our intermodal business. By putting more product on fewer trains, we have increased train length across our system by over 500 feet, or 6%, since January of this year."
Interpretation: The discussion about train length indicates that Union Pacific is focused on increasing the efficiency of its operations by consolidating more freight onto longer trains. This strategy helps to improve asset utilization and reduce the number of trains needed, which can lead to cost savings and operational improvements.

3) Remote Control:
The term "Remote Control" is not mentioned in the transcript.

4) Distributed Power:
The term "Distributed Power" is not mentioned in the transcript.

5) Energy Management:
The term "Energy Management" is not mentioned in the transcript.
type or paste code here

Here is the output from the web ChatGPT4, after uploading the entire 75KB transcript and using the same prompts as above:

1. Locomotives:

Quote: "So, you know, we've got over 500 locomotives parked, and those 500 are ready-to-go locomotives. So we can turn them on in a short period of time."
Interpretation: This statement indicates that Union Pacific has a substantial number of locomotives in reserve, ready to be deployed as needed. This suggests a strategy of flexible resource management, allowing the company to rapidly adjust to changes in demand or operational conditions.
2. Train Length:

Quote: "Train length improved 1% compared to third quarter 2022, despite lower volumes in our intermodal business. By putting more product on fewer trains, we have increased train length across our system by over 500 feet, or 6%, since January of this year."
Interpretation: This improvement in train length demonstrates Union Pacific's efficiency enhancements, where they manage to transport more goods per train. This strategy not only optimizes asset utilization but also could lead to cost reductions and better environmental outcomes.
3. Remote Control:

Quote: "When I think about remote control locomotives and being able to reduce some of our gain productivity and some of that, that's an opportunity for us."
Interpretation: The mention of remote control locomotives indicates Union Pacific's interest in modernizing their operations, potentially leading to increased productivity. This technology could offer new ways to manage locomotives, potentially enhancing safety and operational efficiency.
4. Distributed Power:

Not mentioned in the transcript.
5. Energy Management:

Not mentioned in the transcript.
The above interpretations reflect Union Pacific's focus on operational efficiency, modernization, and resource optimization, key aspects of their strategy to maintain competitiveness and adaptability in the dynamic transportation industry.

So ChatGPT4 got it right - remote control is indeed mentioned in the trascript. I’ve been dabbling with this for a few weeks, trying different prompts, re-ordering the terms, etc. The API still misses the term ‘remote control’ most of the time (~ 80% of the time it misses) while the web ChatGPT4 consistently get it right (100%).

Any suggestions how to get better results from the API?

2 Likes

https://community.openai.com/t/tips-for-searching-large-text-files-via-api/572061gcp
Thanks for sharing this post. It is really helpful for me. I used this tips in my upcoming project.