Split context and prompt into two requests

josef.zamrzla · May 5, 2023, 7:32pm

Hi, I’m trying to use OpenAI API to reformat unstructured text (parsed from PDF) into JSON. The problem is that the source text (context) can easily consume 3000 tokens and therefore I’m not able to get the whole JSON content (not enough tokens). Is there any way how to eg. split the source text and the prompt and send them in 2 requests like in the ChatGPT frontend? (eg. using this way GitHub - jupediaz/chatgpt-prompt-splitter: ChatGPT PROMPTs Splitter. Tool for safely process chunks of up to 15,000 characters per request)
Or am I on a completely wrong way and it must be done another way?
I probably cannot split the source text into several “chunks” because the content can be in a mixed order (due to parsing from PDF)

Thanks, J.

PaulBellow · May 5, 2023, 7:35pm

Welcome to the community!

Even if you split the source text, you can’t get a single JSON out from the two requests. Although maybe you could merge the two JSON files on the backend?

josef.zamrzla · May 5, 2023, 7:50pm

Thanks!
So the “public” chat works in a way that is not possible via API, right? I was wondering to keep the source text as-is, split the formatting prompt into several pieces and then merge the results. But if the source text consumes eg. 3900 tokens then I’m unable to execute any meaningful prompt

PaulBellow · May 5, 2023, 8:01pm

I don’t think what you’re describing can be done with ChatGPT (public) either? It’s about the context window and not having enough tokens. I think what ChatGPT does is summarize earlier messages so it can keep going, but if you feed it too much at once, it’ll wipe its “memory” of earlier messages.

bill.french · May 5, 2023, 8:13pm

This process requires you to chunkify the source document into a collection of objects. Performing that for a PDF while sustaining the flow and order of the text should be possible. I do this all the time - from Google Drive to HTML to text blob, etc. Google Apps Script makes this pretty simple work.

If the objective is to transform PDF document texts into JSON, why even involve GPT? Explain the reason you feel compelled to do this with AI.

josef.zamrzla · May 6, 2023, 8:18am

I need to “deep understand” the source text. If it is eg. a resume then I need to properly extract the work history, skills, etc

bill.french · May 6, 2023, 1:34pm

What you’re describing begins by classifying the document. e.g., is it a resume? But to be able to do that and other things such as establishing a deep “understanding” of the document requires transformation of the generally unstructured document into something with not only structure but data attributes that help to describe its meaning.

Your first inclination to transform it into JSON was correct. I’m inclined to take that to the next level by parsing each of the paragraphs to make it more manageable for AI operations, the first of which is to extract keywords from each paragraph as well as other entities and skills. Once we have this data, we are in a better position to classify the document such as a resume.

But my approach would also use keywords and entities as the basis for building embedding vectors that would allow us to perform semantic similarity discoveries. I would design the system such that all of this data would be used to enhance the original JSON document representation.

Embedding vectors are very inexpensive and they come in handy when someone needs to query all the documents that mention welding, as an example. This sounds like a keyword match process, but it’s not. It needs to perform well when someone uses a term like “welding”, but not “welding”. This is why embedding vectors are so important. They allow us to find documents that are closely associated with other vectors that are like “welding” such as “iron work”. The cosine similarity test will give us a probability score that makes it easy and precise to match.

A Unified Database

This approach makes it possible to move the JSON documents into a database, or back into a file system. You could even store the documents as files, but update a database or spreadsheet with the meta-data so that you could perform other workflows that utilize AI.

A Conversational UX

This architecture also makes possible the ability to create a chat-like interface. As users ask questions about the documents, embeddings are made for each question and then compared to the document embeddings to narrow down the list to only the closest vectors. This makes it possible to use natural language to find very suitable documents.

A Reporting UX

If you take the structured JSON document representations into a database, you also have the ability to create reporting narratives by simply asking natural language queries. This requires aggregation techniques, but the outcome is a reporting layer that allows you to extract lists and statistics about the documents.

josef.zamrzla · May 7, 2023, 6:57pm

It seems to me that the biggest challenge will be the “classification” of each part. Eg. the following text is a fragment of a real resume, exported from PDF. Although it’s pretty long, it’s just a summary of the CV author. The list of his working history starts after this section.

Professional profile

A result-oriented, meticulous, innovative software engineer, also a dedicated team player with a strong
background in designing, planning, and developing software applications. Looking forward to joining a global tech
team to build meaningful products for clients to contribute my experience and satisfy my interests.

Currently I live and work in Frankfurt, Germany. Open to fully remote positions in CET or US time zone.
Key experience

« Have immersed in the software business since 2004, fully participated in many roles: from a developer /
techlead in a big outsourcing company to a man doing everything in a startup.

= Always enjoy programming (and | am very good at it). | code, watch tech tutorials and experiment with different
open source frameworks in my free time as a hobby.

« Good at refactoring code and analyzing tech problems at a high level perspective. Can apply design patterns
and build base framework to make code more abstract and more reusable.

Working style and other domains

= Proactive in communication at work. Speak 8 write proficiently in English with native colleagues 8 clients.
Hands on Scrum / Agile at projects.

= Good at problem solving. Can work under high pressure and lead a team to overcome technical issues.
= Good sense in business areas: Edtech, Ecommerce, ERP, Payments/ Banking.

„- Culture-fit in multicultural companies. Germany (onsite for 1 year), US (remote, 5 years), UK (1 year, onsite for 2
months), Asia (Malaysia - onsite for 3 months, Thailand…)

= Self-balance between the mindset of Team Leader - Team Member, Asian - Western working styles.

« Has 5 years proven in establishment, augmentation and management of offshore development team for
international software outsourcing company based in the US.

bill.french · May 7, 2023, 9:29pm

I don’t spend a lot of time looking at resumes, but I have a hunch that when it comes to skills, they could appear anywhere in the document. Gleaning a key skill may not actually exist inside a section classified as containing skills per se.

Alternatively, I would be inclined to perform a similarity search for the top ten vectors that rank highest as skills and then use them as the basis for developing a skills narrative summation about the person.

UPDATE: @josef.zamrzla I saw this on Twitter, and it reminded me of your case; no idea if this is a real thing, but likely worth investigating.

giovanni.m · October 9, 2023, 10:33am

I ve the same problem, when you chunk the data in different prompts it does not remember everything. Any solution?

Topic		Replies	Views
Chained Prompt to complete text larger than 4000 tokens? API	14	6058	December 25, 2023
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	12540	February 19, 2024
Sending large document via API call and asking for a question over complete document? Prompting api	3	1773	February 26, 2024
Is there any way by which I can let GPT-4 API summarize large PDF texts? API gpt-4 , api	10	11336	May 6, 2024
A conversation using the API API	6	2463	December 16, 2023

Split context and prompt into two requests

Related topics