Keyword search in Word documents fails

Hi Folks,

I want to upload multiple transcripts as Word documents into a CustomGPT and have the files searched for keywords, with the sentences containing the keywords returned to me. However, ChatGPT never gives me all the sentences, only a selection. What could be causing this error?

Here is my prompt:

I want to upload several UX test transcripts in text format and search them for specific keywords.

  1. The document is structured so that the moderator’s questions are in bold the participants’ responses are in blue.
  2. I will provide the keywords in quotation marks.
  3. Please go through the documents line by line.
  4. Find the sentences in which these keywords appear in all uploaded documents.
  5. Return these sentences in full, structured sentence by sentence.
  6. Also, return the sentence before and after.
  7. Clearly highlight the keywords (e.g., by using bold or capital letters).
  8. There should be no limit to the number of results.

Hi @Falsche9

Welcome :people_hugging: to the community!

I do not have a sample Word file to test that you have, but you may try following prompt:

Prompt

You are Transcript SearchGPT, and your primary role is to help users search through multiple UX test transcripts for specific keywords and return relevant context. Your task is to locate sentences that contain the provided keywords, along with the sentence immediately before and after. You will follow these specific guidelines to achieve your task:

  1. Input Structure:

    • The documents are structured such that the moderator’s questions are in bold and the participants’ responses are in blue.
    • The user will provide you with the keywords in quotation marks. Your job is to search for these keywords within the provided documents.
  2. Processing Documents:

    • Scan the entire document sentence by sentence to find all instances where the provided keywords appear.
    • For each occurrence of the keyword, return the full sentence containing the keyword.
    • Additionally, return the sentence immediately before and the sentence immediately after the one containing the keyword.
    • Ensure the output is structured clearly, with each sentence returned in sequence.
  3. Keyword Highlighting:

    • Clearly highlight each keyword when it appears in a sentence by making it bold.
  4. No Summarization or Truncation:

    • Do not summarize or skip any results. Every sentence that contains the keyword, as well as the surrounding context (before and after), must be returned in full, regardless of how many results there are.
  5. Output Formatting:

    • The output should be clear and easy to read. Ensure the sentences are separated and displayed in a structured manner.
    • You are responsible for returning large amounts of results, so if needed, return the data in multiple parts or chunks to avoid hitting any system limits.
  6. Completeness:

    • There is no limit to the number of sentences that can be returned. Ensure that all relevant sentences are included in your response.
    • If the documents are too large to process all at once, you may request the user to provide smaller sections or split the documents into parts for processing.

By adhering to these guidelines, you as Transcript SearchGPT will ensure the user receives a comprehensive set of results containing all sentences where the keywords appear, along with the appropriate context.

Hi @polepole,

thanks for your fast reply. Unfortunately, ChatGPT is still not returning the full list, but only an excerpt.

I thought it might be due to the amount of text GPT has to analyze (a total of 125 pages).

Sometimes it finds all occurrences in a transcript, and then sometimes only a few or none.

I conducted the following test: I know that the word “XY” appears 5 times in Transcript No. 2. When I ask GPT how often it appears, I get the response that the word is not present in Transcript No. 2. Which is definitely wrong.

Tomorrow I will try you prompt uploading and analyzing only 1 transcript.

To test in our envarioment, Is it possible to show how the Word file is structured, and with a sample content creating similar to your Word file with hypotetical scenario. Please do not share your original content because of privacy.

Instead of saying color/format of words in file, describing where locate questions and answers can help your GPT better.

For example each question is between two opening and closing curly brackets:
{{Question}}

and each answer is between two square brackets:
[[answer]]

Sure can I share the structure. Does this help?

It doesn’t sound like this is a good use case for a large language model.

You’d get much better results much more cheaply converting the document to some type of flat text file and just running grep on that.

Hi @Falsche9 !

I agree with @anon22939549 ! But assuming it’s bit of a hassle, what has worked for me in the past when giving ChatGPT very long documents, is the following addition in the prompt: “Iterate over the document one page at the time. For each page <INSERT_INSTRUCTION>, until the last page”.

This seems to “force” it to endure the length of the document. There are no guarantees, but give it a try.

Thanks for sharing @Falsche9

There were some challenges?

Different form of words like:

  • plurals,
  • prefix, suffix,
  • verb form of tenses…etc.

For example, if we say “mouse” it could not find “mice”,
or if we ask “category” it could not show “categorize” verb form or “categories” plural form.

I changed its prompt. I tried in four different session. It worked better, It finds more than one words in same transcript. However, it displayed once a sentence that does not contain the keyword, but the sentence is from the file, not from other source.

My suggestion is don’t continue with long chat, after 5 or 6 queries to prevent hallucination, you should start with a new chat session for your new keywords.

I hope you can improve it in time, I believe you can do it better than me.

I used this prompt

Transcript SearchGPT, and your role is to accurately search for every occurrence of a specified keyword in transcripts, ensuring that no instances are missed, while avoiding the inclusion of unrelated words. You will return every valid occurrence of the keyword in context, following the detailed steps below.


1. Document Structure:

  • Each transcript starts with <Transcript No. X> and ends with </Transcript No. X>, where X is the transcript number.
  • The moderator’s questions are enclosed in triple curly brackets {{{...}}}.
  • The participant’s answers are enclosed in triple square brackets [[[...]]].

2. Keyword Search Process:

  • Search through every transcript thoroughly for every exact match of the keyword.
  • The keyword search should be case-insensitive (i.e., both “System” and “system” should be captured).
  • Include valid variations of the keyword, such as:
    • Plurals (e.g., “system” and “systems”)
    • Different tenses or forms if relevant.

3. Avoiding False Positives:

  • Do not include unrelated words that sound similar to the keyword. For example:
    • When searching for “system,” do not capture words like “tools,” “tool,” “platform,” “structure,” or any other synonyms.
    • Only return matches for the exact keyword and its valid grammatical variations (e.g., singular or plural forms of the same word).

4. Exact Keyword Matching:

  • Ensure that the keyword is matched using whole-word matching, meaning:
    • The keyword must appear as a separate word and not as part of another word (e.g., “system” should not match “ecosystem” or “systematic”).
    • Avoid partial matches, where a part of the keyword is found inside another unrelated word.

5. Returning Results:

For each keyword occurrence:

  1. Return the exact sentence where the keyword occurs, with the keyword bolded.
  2. Return the sentence immediately before and the sentence immediately after the keyword sentence (if available).
  3. If the keyword occurs more than once in the same sentence or paragraph, include all occurrences and ensure each one is bolded.

6. Handling Multiple Occurrences:

  • If the keyword appears multiple times in the same sentence, paragraph, or transcript, ensure every instance is captured and returned.
  • Do not stop at the first instance—continue searching through the entire document to capture all occurrences.

7. Context Retrieval:

  • For each occurrence of the keyword:
    • Return the sentence containing the keyword.
    • Include the sentence immediately before and the sentence immediately after, where possible.
    • If no surrounding sentences are available (e.g., the keyword is at the start or end of a transcript), return the available context only.

8. No Skipping or Summarizing:

  • Do not summarize or skip any occurrences of the keyword.
  • Every instance of the keyword, along with its surrounding context, must be included.
  • Ensure that keywords in close proximity (e.g., appearing within the same paragraph or consecutive sentences) are treated as separate occurrences and handled accordingly.

9. Output Formatting:

  • Present the results in a clear, structured format:
    • Include the transcript number at the start of each result.
    • For each occurrence, return the sentence before, the sentence with the keyword (bolded), and the sentence after.
    • Ensure that sentences are separated and presented clearly without omitting any parts of the transcript.

Example Output:

<Transcript No. X>

Before: [Sentence before the keyword]

Keyword Sentence: [Full sentence with **keyword** bolded]

After: [Sentence after the keyword]

10. Error Handling:

  • If no keyword is found in a transcript, return a message stating: “No keyword found in Transcript No. X.”

Sample Keyword Search:

If you search for the keyword “system”, the expected output would be:


Example:

<Transcript No. 4>

Before: {{{What kind of content moderation policies do you have in place?}}}

Keyword Sentence: [[[We enforce strict rules on respectful behavior, and Share365Days’s trust level system helps in identifying and rewarding constructive contributors.]]]

After: {{{How do you reward active participants in your Share365Days community?}}}


Testing & Validation:

  • Ensure that the search results include all valid occurrences of the keyword, without filtering out legitimate matches or capturing unrelated words.
  • Use example transcripts to test for edge cases, such as multiple occurrences in the same sentence or paragraph, and ensure that every keyword instance is captured.

Final Considerations:

  • Focus on exact matching and avoid including similar but unrelated words (like “tools” when searching for “system”).
  • Ensure the search logic is thorough and continues through the entire transcript after finding the first match.

I uploaded following file with initiating blocks with transcript numbers, {{{questions}}} and [[[answers]]]. It is a ten pages file and contains 10 transcripts.

Content fo Share365Days_Community_Forum_Transcripts_10.docx

<Transcript No. 1>

{{{Can you tell us how your book publishing community benefits from using Share365Days?}}}

[[[Share365Days provides us with a structured platform where authors, readers, and publishing teams can interact. It has improved our communication efficiency and enabled seamless discussions between different stakeholders.]]]

{{{How do you categorize discussions within your Share365Days forum?}}}

[[[We categorize discussions based on book genres, specific authors, and even upcoming book events. This helps users find the exact discussions they are interested in.]]]

{{{What challenges did you face in the initial setup of Share365Days for your community?}}}

[[[Initially, we had some trouble with organizing categories, but the customizable nature of Share365Days allowed us to quickly fix that. Now, it runs smoothly.]]]

{{{Do you use any special integrations with Share365Days for your book publishing forum?}}}

[[[Yes, we integrate Share365Days with our online bookstore. It allows members to easily transition between discussing a book and purchasing it on our store.]]]

{{{How do you keep the community engaged?}}}

[[[We keep engagement high by hosting weekly author Q&A sessions, virtual book signings, and encouraging community reviews and discussions on newly released books.]]]

</Transcript No. 1>

<Transcript No. 2>

{{{How has Share365Days improved communication among authors and readers?}}}

[[[Share365Days has enabled our authors to directly engage with their readers in structured discussions, which fosters a deeper connection between them.]]]

{{{Can you tell us how feedback from readers is managed through Share365Days?}}}

[[[The feedback is easily categorized by book titles or themes. Our moderation team ensures that all feedback is acknowledged, and authors can respond to readers’ input.]]]

{{{What are the primary topics of discussion in your Share365Days forum?}}}

[[[Most discussions revolve around new book releases, author Q&A sessions, and general discussions on literary trends.]]]

{{{Do you use any plugins or extensions in your Share365Days setup?}}}

[[[We use several plugins, including badges for top contributors and analytics tools that track the most active discussions. These features keep the community competitive and active.]]]

{{{What future plans do you have for enhancing the Share365Days experience for your community?}}}

[[[We plan to introduce live streaming events within the forum, where authors can launch their books directly to the community.]]]

</Transcript No. 2>

<Transcript No. 3>

{{{How does Share365Days help in organizing virtual book launch events?}}}

[[[Share365Days allows us to create event-specific threads where readers and authors can interact before, during, and after the book launch.]]]

{{{Do your forum members share content besides discussions?}}}

[[[Yes, they often share book reviews, literary articles, and recommendations, which adds significant value to the community.]]]

{{{How do you handle community moderation on Share365Days?}}}

[[[We have a team of dedicated moderators who ensure that the discussions remain civil and focused. Share365Days’s built-in moderation tools are very helpful.]]]

{{{What feedback have you received from authors regarding Share365Days?}}}

[[[Authors love the platform because it allows them to directly connect with their audience in an organized manner.]]]

{{{Can you tell us about any community-driven projects that have come out of your Share365Days forum?}}}

[[[One of our community-led initiatives is a yearly collaborative anthology where readers and authors come together to write and review short stories.]]]

</Transcript No. 3>

<Transcript No. 4>

{{{How is Share365Days used to manage discussions around specific book series?}}}

[[[We create dedicated categories for each book series, allowing fans to dive deep into their favorite storylines and characters.]]]

{{{What role do readers play in the Share365Days community?}}}

[[[Readers often act as ambassadors, recommending books to new members and facilitating discussions between authors and the larger community.]]]

{{{What kind of content moderation policies do you have in place?}}}

[[[We enforce strict rules on respectful behavior, and Share365Days’s trust level system helps in identifying and rewarding constructive contributors.]]]

{{{How do you reward active participants in your Share365Days community?}}}

[[[We have a rewards system where top contributors earn badges, and sometimes they even get access to exclusive content like early book releases.]]]

{{{Has Share365Days made an impact on book sales?}}}

[[[Yes, our discussions often lead to spontaneous purchases directly from the integrated bookstore link, which has been fantastic for boosting sales.]]]

</Transcript No. 4>

<Transcript No. 5>

{{{How does your community use Share365Days for book discussions?}}}

[[[Share365Days serves as a hub for book discussions, ranging from new releases to classic literature. It brings together readers from all over the world.]]]

{{{Do you have special events within the Share365Days forum?}}}

[[[Yes, we hold book club meetings where authors sometimes join, as well as writing competitions and collaborative storytelling events.]]]

{{{How has the analytics feature in Share365Days helped your community management?}}}

[[[Analytics have helped us identify which topics resonate most with our community, and we use that information to tailor content to our audience’s preferences.]]]

{{{Can you talk about the importance of community feedback on Share365Days?}}}

[[[Feedback is crucial. It helps authors understand what readers appreciate and where improvements can be made. Share365Days’s upvoting and tagging systems make feedback easy to manage.]]]

{{{Do you use Share365Days to manage internal team communications?}}}

[[[Yes, we have private categories dedicated to our editorial and marketing teams where we plan events and discuss upcoming releases.]]]

</Transcript No. 5>

<Transcript No. 6>

{{{How do you ensure that new members engage with the community on Share365Days?}}}

[[[We have a welcoming committee and introductory threads that guide new members through our most popular categories and discussions.]]]

{{{How has Share365Days improved communication across your publishing team?}}}

[[[Share365Days’s category-based structure allows each department to manage their own discussions while staying connected with the larger community.]]]

{{{What tools do you use within Share365Days for content moderation?}}}

[[[We use a mix of automated spam filters and manual moderation. Share365Days’s moderation tools are intuitive and effective in maintaining a healthy environment.]]]

{{{Do you use gamification to keep your members engaged?}}}

[[[Yes, we have badges for contributions, quizzes about popular books, and leaderboards for the most active members.]]]

{{{What’s the most popular category on your Share365Days forum?}}}

[[[Our ‘New Releases’ category is the most popular, as it’s where readers and authors discuss the latest books and their thoughts on them.]]]

</Transcript No. 6>

<Transcript No. 7>

{{{How does Share365Days assist in organizing your community’s events?}}}

[[[We use event threads to keep everyone informed. These threads have countdowns, and members can RSVP to virtual or live events through the forum.]]]

{{{How do you promote your books within the Share365Days community?}}}

[[[We have promotional categories where we post teasers, cover reveals, and early chapters for our upcoming books.]]]

{{{What makes Share365Days stand out from other community platforms for your industry?}}}

[[[Share365Days is more customizable and professional than other platforms. It’s easy to tailor the experience to fit our book publishing needs.]]]

{{{How do you manage disagreements or controversial topics within the community?}}}

[[[We have strict guidelines, and our moderation team is quick to resolve any disputes. We also have automated warnings for offensive language.]]]

{{{What new features would you like to see added to Share365Days?}}}

[[[We’d love to see more analytics around user engagement, perhaps even predictive insights for upcoming trends in discussions.]]]

</Transcript No. 7>

<Transcript No. 8>

{{{How does your Share365Days community interact with authors?}}}

[[[Authors frequently participate in live discussions and Q&A sessions. We set up special categories just for author-reader interactions.]]]

{{{Do you think Share365Days has helped your community grow?}}}

[[[Absolutely. Our user base has grown significantly since we started using Share365Days. It’s become a vibrant, thriving hub for our readers.]]]

{{{Can you tell us about your community’s favorite book discussions?}}}

[[[Our members love discussing upcoming releases and sharing their thoughts on exclusive book previews.]]]

{{{How do you manage feedback from authors on the platform?}}}

[[[We’ve created a special feedback category where authors can share their thoughts about how the platform is helping them connect with their audience.]]]

{{{What’s the best part of using Share365Days for your book publishing business?}}}

[[[The best part is the community spirit. Our readers and authors feel like they’re part of something bigger, and Share365Days makes it easy to maintain that connection.]]]

</Transcript No. 8>

<Transcript No. 9>

{{{How do you use Share365Days for managing book clubs?}}}

[[[Share365Days allows us to create dedicated threads for each book club, where readers can participate in discussions and schedule meetups.]]]

{{{What do your community members like most about Share365Days?}}}

[[[They love the simplicity and organization of the platform. The clean design and ease of finding relevant discussions are big pluses.]]]

{{{Have you integrated any tools to make the community experience better?}}}

[[[Yes, we’ve integrated Slack for internal communications and also use the Share365Days API to pull data for our external analytics.]]]

{{{What is the most important lesson you’ve learned from running your Share365Days community?}}}

[[[Consistency is key. Regular posts and engagement from both moderators and participants keep the community alive.]]]

{{{How do you plan to grow the community further?}}}

[[[We’re planning to partner with other publishers and authors to bring their fans into the forum, which should help us grow the community even more.]]]

</Transcript No. 9>

<Transcript No. 10>

{{{How do you engage your community with ongoing discussions?}}}

[[[We use prompts and open-ended questions that encourage members to share their thoughts on the books they’re currently reading or looking forward to.]]]

{{{Can you tell us about your use of tags in Share365Days?}}}

[[[We use tags extensively to help users quickly find books based on genre, author, or even by their status, such as ‘recently published’ or ‘upcoming’. This has improved user navigation and engagement.]]]

{{{How do you handle negative feedback from readers?}}}

[[[We treat negative feedback as an opportunity for growth. Authors are encouraged to respond and engage with constructive criticism, and we mediate where necessary.]]]

{{{What’s the most important feature in Share365Days for your community?}}}

[[[The customizable categories and notification system have been critical. Members never miss a discussion or a new post about their favorite books.]]]

{{{What kind of user events do you organize through Share365Days?}}}

[[[We’ve organized virtual author signings, live Q&A sessions, and even writing workshops where authors guide aspiring writers in our community.]]]

</Transcript No. 10>

Also you can use Python script, but your keywords should match exact words.
You should install python-docx library.

You may use this sample script, It is from ChatGPT, not my own code.
It works. Well done ChatGPT!

Python Script

import re
from docx import Document
import os # To handle directory creation

Function to load and extract text from a .docx Word file

def load_transcript_from_docx(file_path):
doc = Document(file_path)
transcript =
current_transcript_no = None

# Iterate over the paragraphs in the document
for para in doc.paragraphs:
    text = para.text.strip()

    # Identify if the paragraph contains a new transcript number (e.g., <Transcript No. 4>)
    if text.startswith("<Transcript No."):
        current_transcript_no = text.split("<Transcript No. ")[1].split(">")[0].strip()
    
    # Only add non-empty paragraphs to the transcript list
    if text:
        transcript.append((current_transcript_no, text))

return transcript

Function to search for keywords in the transcript and return paragraphs with context

def search_keyword_in_transcript(transcript, keyword):
keyword_pattern = re.compile(rf’\b{keyword}(s|ing)?\b’, re.IGNORECASE)
results =

# Search for the keyword in each paragraph
for i, (transcript_no, paragraph) in enumerate(transcript):
    clean_paragraph = re.sub(r'[{}[\]]', '', paragraph)  # Remove unwanted curly and square brackets
    clean_paragraph = clean_paragraph.replace("�", "'")  # Fix apostrophe issues caused by encoding

    if keyword_pattern.search(clean_paragraph):
        highlighted_paragraph = re.sub(keyword_pattern, r'**\g<0>**', clean_paragraph)
        before_paragraph = transcript[i - 1][1] if i > 0 else "(No previous paragraph)"
        after_paragraph = transcript[i + 1][1] if i < len(transcript) - 1 else "(No following paragraph)"
        before_paragraph = re.sub(r'[{}[\]]', '', before_paragraph)  # Remove unwanted characters in context paragraphs
        after_paragraph = re.sub(r'[{}[\]]', '', after_paragraph)

        results.append({
            'transcript_no': transcript_no,
            'before': before_paragraph,
            'keyword_paragraph': highlighted_paragraph,
            'after': after_paragraph
        })

return results

Main function to run the keyword search on the docx file

def main():
# Specify the path to your .docx file
file_path = ‘Share365Days_Community_Forum_Transcripts_10.docx’

# Specify the keyword you want to search for
keyword = 'system'  # Change this to whatever keyword you want to search

# Load the transcript from the .docx file
transcript = load_transcript_from_docx(file_path)

# Search for the keyword in the transcript
results = search_keyword_in_transcript(transcript, keyword)

# Create 'results/' directory if it doesn't exist
os.makedirs('results', exist_ok=True)

# Create a dynamic filename based on the keyword
output_filename = f'results/result_output_for_keyword_{keyword}.txt'

# Save results to a file
with open(output_filename, 'w', encoding='utf-8') as output_file:
    for i, result in enumerate(results, 1):
        output_file.write(f"\nResult {i}:\n")
        output_file.write(f"Transcript No.: {result['transcript_no']}\n")
        output_file.write(f"Before: {result['before']}\n")
        output_file.write(f"Keyword Paragraph: {result['keyword_paragraph']}\n")
        output_file.write(f"After: {result['after']}\n\n")

print(f"Results saved to {output_filename}")

Run the main function

if name == “main”:
main()

This is output on VS Code screen: