Cannot get GPT-4o to count bullets in a markdown file

I’ve spent days getting a prompt built to read a markdown file and answer questions about one single document. I’m using gpt-4o with temperature of 0.

In order to get it to work somewhat well, I’ve broken the prompt down into baby steps. But it’s failing on Step 1 which is reading the markdown file. To “slow it down” and confirm its working I’m asking it to return the number of bullets in the “statefile”. There are 80. I get a different number for each run, usually 50-something. The statefile is delimited by XML tags at the end of the prompt.

What to do?? Here’s the first part of the prompt:

You are a chatbot designed to answer questions about a person’s performance. You will find the information you need to answer any question in a “statefile” which I will give you later.

Here’s how you should proceed:

  1. You will read the entire statefile which is in markdown format. You will interpret the statefile as follows:

    • Lines starting with # are headings that provide category context for the lines that follow it.
    • Lines starting with - are “valuelines” in the general format “label(s) is/are value(s) {categories}”.
    • Values can be a string, number, and/or color name.
    • Color interpretations:
      • blue = exceptionally good
      • green = meets expectations
      • chartreuse or yellow = needs attention
      • red = needs urgent attention
    • Categories in {} can be multiple values separated by commas. One type of category is “importance level”.
    • Importance levels interpretations: !1 = very important, !2 = important, !3 = somewhat important.
  2. Once you have read the entire statefile including headings and valuelines, you will confirm that you have read and understand each and every line of the file. You will then state the number of lines that you have read. For example,

I have read the entire statefile. There were 10 headings and 25 valuelines.

3 Likes

Counting is iffy on LLMs for reasons that I won’t go into this post.

In my use case (of the multi agent framework), I absolutely need the exact counts (Assistants API - Access to multiple assistants - #36 by icdev2dev).

IF you REALLY need the exact count, I would suspect that you would be well suited to be very deterministic about it. (the above post mentions how I “occasionally” post the count onto the thread.) In your case, you might want to post the count before hand, IF you need it.

IMO there is no particular need to “slow it down”. There are ways of confirming that is actually reading the file and extracting the meaning; besides “slowing it down”.

3 Likes

What are you trying to do with the results? The LLM sees all of the tokens that represent the statefile even if it can’t count them.

Do you want it to focus on specific parts of the text for its answer? Can you give a sample text file? Maybe that can help determine whether you an do it algorithmically or not.

What you can do is to ask the LLM to give you a list of what you want and check the length of the list.

You can also try out a prompt that extracts the structure I believe you want here: Analyze Statefile — Prompt Fiddle

Here’s also a sample file I generated using a different model (but unsure if this is what you’re looking for):

# Personal Health

- Weight is 68 kg {physical health, !2}
- Sleep quality is good {well-being, !1}
- Daily steps are 8,500 {physical activity, !2}
- Stress level is moderate {mental health, !1}
- Water intake is 2 liters {hydration, !2}
- Meditation practice is 15 minutes daily {mental health, !3}

# Career

- Job satisfaction is high {professional growth, !1}
- Current project status is on track {work performance, !2}
- Skill development is progressing {career growth, !2}
- Networking efforts are minimal {professional relationships, !3}
- Work-life balance is acceptable {well-being, !1}
- Annual performance review is green {job security, !1}

# Financial Health

- Savings rate is 20% {financial stability, !1}
- Investment portfolio performance is blue {wealth growth, !2}
- Credit score is 780 {financial health, !2}
- Emergency fund is 6 months of expenses {financial security, !1}
- Debt-to-income ratio is 28% {financial health, !2}
- Budget adherence is yellow {financial discipline, !2}

# Relationships

- Family communication is weekly {personal life, !2}
- Friend meetups are monthly {social life, !3}
- Romantic relationship status is single {personal life, !2}
- Social media engagement is moderate {digital presence, !3}
- Conflict resolution skills are improving {interpersonal skills, !2}
- Empathy level is high {emotional intelligence, !1}

# Personal Development

- Reading habit is 2 books per month {knowledge acquisition, !2}
- New skill learning is in progress {self-improvement, !2}
- Goal setting is consistent {personal growth, !1}
- Time management is yellow {productivity, !1}
- Self-reflection practice is weekly {self-awareness, !2}
- Hobby engagement is satisfactory {work-life balance, !3}

# Home Environment

- Cleanliness level is acceptable {living conditions, !2}
- Home maintenance tasks are up-to-date {property care, !3}
- Energy efficiency is improving {sustainability, !2}
- Home office setup is ergonomic {work environment, !2}
- Neighborhood safety is green {security, !1}
- Plant care is thriving {home aesthetics, !3}

# Community Involvement

- Volunteer hours are 5 per month {social responsibility, !2}
- Local event participation is quarterly {community engagement, !3}
- Charitable donations are 3% of income {philanthropy, !2}
- Neighborhood relations are cordial {social capital, !3}
- Environmental consciousness is high {sustainability, !2}
- Civic engagement is moderate {citizenship, !3}
3 Likes

As people have said above, LLMs are bad at counting. It’s way better to ask it to extract a list and then check the length of that list, like Aaron’s promptfiddle example shows.

1 Like

From my practice, if the task can be broken into subtasks, to get the best results, use separate API calls for each of the subtask with their respective prompts.

Stays to confirm, but temperature of zero means the model can pick up the temperature it “considers” appropriate for the prompt. I personally don’t go below 0.01

I have a similar task in our legal analysis tool and even fine-tuned model kept failing at this task at around 5% (we usually deal with 3-15 items per list). So replaced this step with regular function (classic code).

Seeing the beginning of the prompt:

  1. Tasks are not clearly defined (the goal of the operation)
  2. No hints on how to achieve the task (step process)

I have not seen the whole prompt, but honestly, start by writing down the complete workflow of how to achieve the thing you want the app to do, be as detailed as possible, then see how to break it down in sub-modules where you define the intermediate goals, then see how humans improve results in each sub-module. And only then write your prompts (separate prompt for each operation/API call).

If you do this, I bet you’ll get to the next level of performance.

1 Like

Thanks for the thoughtful replies.

Writing the Python code to make AI calls to analyze the < 100 line file took less than a minute. Doing the prompt engineering, maybe a week!

I finally have a prompt that seems to work 100% of the time with the “gold” questions (standard questions). However, the prompt file is longer than the file it needs to process! It consists 12 steps, many examples, etc.

Clearly, AI for many tasks is not worth the work. I hope OpenAI focuses on this. This is a simple Q&A application over structured data. AI should be able to do it 100% and quickly.

Thanks for posting your solution, we appreciate the hard data on what works well for LLM and what does not. Clearly, you have demonstrated that a certain class of deterministic problems, such as counting the number of bullet list items in a document, are not the proper application of LLM technology. That’s probably why OpenAI has created an architecture that promotes the use of “Tools”. Tools like the use of a calculator for a human. If you can write a deterministic tool and then reference that in your prompt, I think that will be a much better approach and goal. We shouldn’t expect OpenAI to necessarily create all of the “tools” for us.
I suggest we have OpenAI focus on general intelligance and not “tools”, that can be easily created or reused as part of a tool api.
Thanks.

OpenAI is going to do what it is going to do.

However the audience should be cognizant of the tools to use (a straight deterministic program vs less capable of general intelligence but narrowly tailored fine tuned model vs more capable of general intelligence).

@mcdow, Just for clarity, my application does not require counting. The only reason I was asking gpt-4o to count is to determine why results were often incomplete depending on how the question is asked. When I instead asked “what is the first line you read” and the “last line you read” it was clear that it had all the data although the result was still incomplete. By jumping through hoops, it finally return all the results all the time.

1 Like

I think this is the wrong way to think about it: as Analyze Statefile — Prompt Fiddle shows, it’s entirely possible to craft a prompt for GPT4o that achieves what you want to do which is much shorter than the input file.

The better way to think about it is that it takes time and effort to learn how to effectively craft prompts for LLMs, very similar to how you need to spend time teaching an intern how to do things the right way; prompts like:

  • “count the number of items in a list”
  • “the first line you read”
  • “the last line you read”

are all, honestly, prompts that I expect to perform poorly, by nature of how LLMs work. (This has to do a lot with why counting is hard, plus newer research showing that fine-tuning models to be better at math hurts their semantic reasoning skills.)

I was asking those silly questions and adding steps as “best practices” has been suggesting to “show things down”.

Where things got stuck is I have the prompt read this markdown file which is mostly lines “factor is/are value {category}”. I ask for values that are red, yellow, chartreuse… It would only return 3 out 4 lines of the file with those colors over most runs. If I ask a follow-up question, it will always return the fourth line. Not okay. I tried and tried but until I started asked busy-work questions, it would not return the fourth line.

In the past, I’ve learned that when I want instruction-following, I move it over to Claude and it works great. If I want a chat bot, OpenAI works great.

I’m looking forward to new features, models, etc. from OpenAI. But what would be better is to help us developers get the current stuff to work better. Like metaprompt tools. Like fixing a well documented bug with the API for input context. Etc.

C3.5 RESPONSE.

Thank you for sharing your challenge with the prompt. I understand you’re having difficulty getting consistent results for counting the number of bullets in the statefile. Let’s consider a few potential solutions:

  1. Explicitly request bullet counting: Modify step 2 to specifically ask for a count of bullet points. For example:

    “2. Once you have read the entire statefile including headings and valuelines, confirm that you have read and understand each line. Then state the number of lines you’ve read, specifying the count of headings, valuelines, and bullet points separately.”

  2. Separate the tasks: Instead of asking for multiple tasks in one step, break it down further:

    “2. Read the entire statefile, including headings and valuelines.
    3. Confirm that you have read and understood each line.
    4. Count and report the number of headings in the statefile.
    5. Count and report the number of valuelines in the statefile.
    6. Count and report the number of bullet points in the statefile.”

  3. Add a verification step: After the model reports the counts, ask it to double-check:

    “7. Review the statefile again and verify your counts. If there are any discrepancies, report the corrected numbers.”

  4. Use regex pattern matching: If possible in your setup, you could add a step that uses a regular expression to count bullet points more accurately:

    “Count the number of lines that match the pattern ‘^\s*[-*]’ in the statefile and report this as the number of bullet points.”

  5. Increase context window: Ensure that your entire statefile fits within the model’s context window. If it’s too long, the model might not be able to process it all at once.

  6. Chunk the input: If the statefile is very large, consider breaking it into smaller chunks and processing each chunk separately, then combining the results.

  7. Use a different delimiter: Instead of XML tags, try using a unique string that’s unlikely to appear in your markdown, such as “<<<STATEFILE_START>>>” and “<<<STATEFILE_END>>>”.

If these adjustments don’t resolve the issue, it might be helpful to see a sample of your statefile to identify any potential formatting issues that could be causing confusion for the model.

GPT has discalculia. Every time I ask for an answer with a word count of the generated answer, it is wrong.

Okay I’ll try that. Note that the actual prompt is a whooping 12 steps in order for it to give me the correct answer. I only included part of the prompt.

Even though the prompt is ridiculous, I hate to change it because it actually and finally works

Your instructions fail right where you tell it to fail. You say you will give the statefile later, then you give it instructions to use the file that isn’t there yet, even though it may be. I guess it assumes you mean later when it gets the file to do the rest of the instructions :slight_smile: That may not be the issue, but you may want to start there. Also, you can tell GPT to use Code Interpreter and just count the lines that start with #, it will write the program to count, run it, and give you results.

We’ve been experiencing a similar issue in getting GPT to read and process 100% of the contents in a input file in Word docx format. The Word doc contains structured tables of content providing small blocks of text on insights and value propositions for a given product proposition.

To try to make sure that GPT reads all the rows of content in the key tables we ask it to read all the rows, analyze the content of each row, and summarize what it has found using a set of highly structured prompts.

Even with this multistep approach, GPT still sometimes only produces summaries for say 6 out of 7 value proposition rows, sometimes it will summarize all 7 rows and sometimes it will come up with summaries for 8 rows (i.e. it has come up with an extra value proposition row of its own)!

The really interesting thing is when you have a follow-up conversation with GPT about why it has done this. Sometimes it talks about resource optimization, other times it says the most important content in a document is usually found at the beginning so it has weighted that content more in its response. It’s very much like talking to a human intern and asking them why they haven’t followed your instructions precisely!

For our application, which is generating marketing content, this variability in response is acceptable. For other applications it won’t be and I’m not sure what you do about it.

Regex should help @Tahl . This works fine for me (rely on code, not the LLM directly):


write code to count lines in a file that start with "- " (dash space). use a regex.

then run it on this input:

Personal Health

  • Weight is 68 kg {physical health, !2}
  • Sleep quality is good {well-being, !1}
  • Daily steps are 8,500 {physical activity, !2}
  • Stress level is moderate {mental health, !1}
  • Water intake is 2 liters {hydration, !2}
  • Meditation practice is 15 minutes daily {mental health, !3}

Career

  • Job satisfaction is high {professional growth, !1}
  • Current project status is on track {work performance, !2}
  • Skill development is progressing {career growth, !2}
  • Networking efforts are minimal {professional relationships, !3}
  • Work-life balance is acceptable {well-being, !1}
  • Annual performance review is green {job security, !1}
1 Like

Just of curiosity, what prevents you from handling them one by one?

  1. You are mentioning a File. Are you uploading this file to some endpoint? Are you attaching it? Or are you sending it as text straight to the model?

  2. As mentioned, counting is not a reliable method. A better method of understanding what the model is seeing is attaching an ID to each item. Something as simple as the person’s name.

  3. Distillation & separation is key for volume management. The larger the context, the more saturated obviously it becomes. As @sergeliatko has mentioned you want to break down the tasks to be as focused as possible.

For example, you can run some tests using the Batches endpoint. Try with a batch of 10 points, 20, etc. You can do this all within 1 day. Find the optimal amount of bullet points. Use the model to distill the information, and then gather the results and then perform a final evaluation on that.

  1. Remove the noise. Your instructions are verbose and unnecessary.
  • Lines starting with # are headings that provide category context for the lines that follow it.
  • Lines starting with - are “valuelines” in the general format “label(s) is/are value(s) {categories}”.
  • Values can be a string, number, and/or color name.
    Categories in {} can be multiple values separated by commas. One type of category is “importance level”.

In my opinion this is all useless. Markdown is the model’s native tongue. Work with the model.

Ultimately, you are trying to instruct already known structures, which can cause a lot of friction and confusion. There is no benefit here. I promise that the model will be able to understand your formatting unless it’s unorthodox - in which case I highly recommend reformatting.

1 Like

Our application requires GPT to analyze multiple individual value propositions (rows of content) and then synthesize what it has learned to create composite paragraphs of text to serve different content marketing requirements.

The hardest ask is to create an “elevator pitch” for a proposition which essentially boils down perhaps 10 or more rows of content (comprising thousands of words) from the input document into a couple of paragraphs which summarize the most important concepts and points.

Although crafting the optimum set of prompts took a long time, the results generated by GPT under this type if very close direction are remarkably good. But the interesting thing is, everytime you run the process using the same input data (Word doc) and the exact same prompts, you get quite different text - much as you would if you asked different human copywriters to do the task!