Poor quality response on trained LLM with pdf files

66c3ad033f2a80a36a93 · February 4, 2024, 2:53am

I trained using gpt4 and gpt4-turbo a PDF which is a study guide and tried to get GPT to write mock exam questions based on the study guide but the quality is quite bad, as in some answers are blatantly wrong. What else can I do to improve? Or is the technology just good enough for it yet? Besides writing mock exam questions I intend to use it to generate audio course transcripts. Any help is appreciated!

Diet · February 4, 2024, 3:15am

Hi!

When you write a study guide, what do you do? Do you just start with the title, and then write the whole thing from front to back in one go?

I imagine there’s a method to composing, revising, and validating your study problems.

You’ll have to teach the system how to accomplish your task.

It’s like throwing sand, lithium, nickel, and petroleum into a furnace, and expecting an iPhone to pop out. Kinda. Sorta.

DevGirl · February 4, 2024, 10:31am

I’m very familiar with the issues you’re experiencing and there is a solution.

The first thing to try is opening your PDF in Adobe Acrobat Pro and exporting to XML (specifically XML, not another format).

Then open the document and explore it. You should notice a variety of bookmarks (for chapters/sections/etc), other XML markup, etc.

Before going any further, invest a good deal of time going through this. Is the information presented logically where the LLM would understand it, cohesively, etc?

The reason I suggest this, PDF’s are not stored in linear-text fashion in many instances and an XML export is the quickest way to realize the hidden issues you might not be aware of.

Next…

You can start editing your XML to help ChatGPT better understand it. This makes a very substantial difference. Even better, you can strip most of the XML and keep the bare-minimum markup to best focus on the substantive content.

Finally…

When writing your prompt, towards the top and towards the end, explicitly state that it MUST “cite” the data from the attached file and that referring to content outside the attached file will result in a failure.

After these steps, please return to report your results and based on your experiences, I can custom-tailor additional tips to help you

66c3ad033f2a80a36a93 · February 4, 2024, 7:16pm

The study guide isn’t written by me, but by an org. The topic is more about regulations or law. I’ll try out the xml method, but not sure whether my app accepts training of xml documents.

What’s the key to train gpt?

66c3ad033f2a80a36a93 · February 4, 2024, 8:59pm

I am trying to strip all the tags and keep the text. But what should I do with the original images such as tables? How can I best train gpt on the info/knowledge for tables and charts?

DevGirl · February 4, 2024, 10:52pm

This is precisely why exporting to XML can be so valuable. You quickly learn what/how the information differs from your view of the material.

If you have a lot of content in images, ideally it would be best to OCR it first. I realize that the bulk of the content is already in text format; however, OCR’ing is the fastest way to get the remainder of the content into useable format.

If you find it too time-consuming to strip the XML, you can always export to straight text (or to RTF, then from RTF to Markdown, or using a non-Adobe too, to HTML). These are quicker; however, XML is nice because it tells you a great deal about the content and makes it easier for you to strategize how to catalog it for ChatGPT.

Regarding whether something accepts XML; you can always save as a .TXT file. LLM’s understand markup regardless of what file extension is used.

Regarding the text originating from your organization, I am not sure how that relates. I may be overlooking something. Please let me know if there is a related question and I’ll be happy to help.

66c3ad033f2a80a36a93 · February 5, 2024, 5:26am

One of the more serious problems I have besides wrong answers is that despite me specifying in my prompt that answers can only be either AB, AC, AD, BC, BD, CD, ABC, ACD or BCD, the gpt still violates my spscification by writing questions with answers being A, B, C, D or ABCD. Another violation is that I had specified that I want it to write scenario-based questions but it still gives me simple direct questions like “What is the factors when considering xxx”

66c3ad033f2a80a36a93 · February 5, 2024, 6:21am

I also specified clearly no option can be “None of the above” or “All of the above” but gpt still violates my constraints. What have I done wrong?

DevGirl · February 5, 2024, 11:27pm

It’s impossible to offer any educated ideas based solely on this.

You’d have to provide the full prompt – and even that isn’t a guarantee if there are issues with your embedded content.

If you’d like to provide your prompt and a sample of what you’re hoping to receive, we can provide specific advice that will help.

Diet · February 5, 2024, 11:35pm

do you wanna share your prompt so we can take a look?

66c3ad033f2a80a36a93 · February 6, 2024, 12:25am

My prompts have been iterating and the results had been unreliable. Here was one version:

Take the role of an exam question setter for a regulatory organisation. Write 5 unique sets of exam questions based on the context. The questions must be scenario-based, for example: “Jaime is a corporate banker at Pacific Finance (‘Pacific’), which sets company policies on corporate travel. In June, she went on a 2-day trip to meet Atlantic Corporation (‘Atlantic’) as part of the due diligence process for an upcoming transaction. On this trip, her expenses on food and drinks were 50% higher than the per diem allowance. Knowing that the full claim amount would not be approved, she made a claim for the maximum claimable sum and the difference was deemed a personal out-of-pocket expense. In August, Jamie went on a 5-day trip to attend a series of business conferences and then included the out-of-pocket expense she incurred on the trip in June. As most of the meals in the August trip were covered by the conference organisers, Jaime knew that she could include the out-of-pocket expense from June without raising suspicion. Based on ethical principles, why is Jaime’s conduct unacceptable for a corporate finance representative?”, that test test-takers on applying their knowledge. There must be exactly 4 options to choose from, known as A,B,C,D. Do not use “None of the above” or “All of the above” as any of the options. A question with the answer being “A and B” shall have its answer being represented as “AB”. Do not design questions that have answers being “A”, “B”, “C”, “D” or “ABCD”, instead let the answer be having 2 or 3 of the options being correct. Include an explanation that explains why the correct options are correct and why the incorrect options are not correct. Include a citation of a phrase up to 10 words from the context. Include the relevant chapter and/or section number. The final output for each set of question should be a one line of text, where each the question, option A, option B, option C, option D, answer, explanation, citation and relevant chapter number are separated by the character “|”, for example: “What can be the colors of an apple? | Red | Blue | Green | Yellow | AC | explanation | Citation | Chapter 2.5.1”

Diet · February 6, 2024, 12:40am

the technology to do this was around in the 90s

But yes, the technology probably isn’t advanced enough yet to reliably do what you’re asking, the way you’re asking.

Imagine giving this chunk of text to your mother and then tell her to start typing right away, while snipping your fingers, telling her to go faster. No time for thinking! Just type! Snip! Snip!

As I mentioned before, if you want reliable results you need to lay out a process, and allow the model to follow that process.

66c3ad033f2a80a36a93 · February 6, 2024, 1:55am

How do I “lay out a process” then?

Diet · February 6, 2024, 1:59am

Have you tried this?

https://platform.openai.com/docs/guides/prompt-engineering/give-the-model-time-to-think

66c3ad033f2a80a36a93 · February 6, 2024, 6:11am

In the study guide it contains some case studies in which the story is fictional. However when I prompt the trained gpt to write exam questions, it tried to use that case study’s scenario, and seems to assume what was described in the case study as facts, despite the study guide clearly labelling the case study as an “Example”. How do I overcome this?

DevGirl · February 6, 2024, 8:15am

A lot more effort has to go into organizing the prompt to reinforce context of rules separately from the actual task, etc.

Once you’ve done that, ideally you need to give it a minimum of what I suspect is five shots (very well-thought / diverse examples that help train the LLM) as part of the prompt itself, not the embedding.

For example, something like:

# Task: explain
## Rules: explain w/more succinct language
- Rule 1
- Rule 2 (etc)
## Style: explain (optional in your case: the items that aren't as strict as your rules)
## Examples: explain
- Example 1
- Example 2 ... Example 4+

And yes, use the Markdown format I provided here. The more content you provide an LLM, the more important context and organization become. And the more you attempt to give it “rules” (eg - think of it like a programming language), the more you need to structure for sake of creating pseudo-code (carefully, methodically explained).

66c3ad033f2a80a36a93 · February 6, 2024, 7:32pm

Ok, so shall I conclude that having a text based embedded file is almost already the best, and that the problem I’m facing is more of formulating a well written prompt?

larissapecis · February 6, 2024, 11:22pm

As @DevGirl said:

A lot more effort has to go into organizing the prompt to reinforce context of rules separately from the actual task, etc

chatGPT is conversational. You will have to model it, “train” is not the word at all (by the way it is also technically wrong too).

Take some time to read the guide yourself. Then, copy the most important rules into the prompt CLEARLY. It is Natural Language, try to use points, line breaks, and everthing that makes language readable. Additionally to what @DevGirl said, before feeding the data, prepare it to receive your file:

[…]
## Rules
[…]
Now, you will receive a study guide containing:

How to study right;

How to not study right;

How to perform tests.

chatGPT behaves differently with external sources (pdf, xml…). Neither Advanced Analysis (Code Interpreter - I haven’t used it in weeks, but might have improved since then) or the plugins are as accurate as the prompt itself.

You might need some prompts/answers to get what you want, but it is worth it. By the way, you can “clean” the bad answers from the conversation so that it will make the process less messy.

DevGirl · February 7, 2024, 4:20am

Question 1: Is a text-based embedding the best? Technically the answer is that the most understandable and semantically-categorized/indexable content is the best.

You mentioned that it requires a lot of work to properly convert the PDF to a text file for embedding purposes. If it requires more time than a one-hour process of locating a LangChain Python script that you can re-use, I think that might help you.

Simply chunk-out the text semantically and then edit/produce your new embedding document.

This requires some time/effort on your part; there is more than I can explain here. Learning about LangChain text splitters might be a good starting point:

Try this utility (source code available) as an example:

This may not be necessary in this instance; however–

It’s extremely valuable component to understand if you’ll be doing much with embeddings / vector indexed data.

Question 2: Importance of formulating a well-written prompt. Yes, this is always important. However, it becomes far more important the further you move from the intended purpose of LLM’s.

For example, if you’re not asking an LLM to provide a written treatise on a topic, then you’re typically moving into the realm of instructions/rules that you communicate in the prompt. This is moving a bit away from LLM’s most fundamental design to a minor degree; however, the more complex this becomes (the more it approaches pseudo-code with logical application such as binary IF/THEN, etc) – you’re moving further.

Finally, as you supply more text/rules, you add even more complexity and once you add the requirement of referring to indexed data, you depart native LLM design further.

In short, carefully prompts (or to parallel conventional/legacy instructions – clear communication) are always crucial. In your case; however, prompts are even more important.

Therefore, the answer is that first you most assure your embedding data is being indexed and interpreted as you expect. Then you must iteratively optimize your prompt.

Neither one alone will resolve your issue fully; however, the tips I provided in prior posts, in the prioritized-order I provided them, can produce the results you’re seeking.

Good luck to you

66c3ad033f2a80a36a93 · February 7, 2024, 6:46pm

@DevGirl @larissapecis (at)Diet (sorry, new users can only tag 2 users in a single post)

I thank all of you for being so patient with me, but I conclude that I’m just too dumb to meet my goals on my own.

I’m thinking whether I can hire you to solve this problem for me? Drop me an email at 2kvx95vkl0 (at) secretmail dot net , thanks!

P.S: this email address is a temporary email valid only for a few days. I’ll email you from my real email after i receive your email. Do identify yourself in your email , and make a post here with a matching code-word just to prevent scammers from impersonating you by sending me an email claiming to you

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4496	January 26, 2024
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	20	5336	June 7, 2025
The length of the embedding contents API	48	34473	December 13, 2023
Prompt Engineering Showcase: Your Best Practical LLM Prompting Hacks Prompting prompt , prompt-engineering , prompt-hacks	48	6244	July 23, 2025
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	550	January 25, 2025

Poor quality response on trained LLM with pdf files

Next…

Finally…

Try this utility (source code available) as an example:

This may not be necessary in this instance; however–

Related topics