Conversion of entire PDF into JSON Format

ashdixit2003 · December 23, 2024, 5:45pm

I am using Gpt 4o which has around 16k output token limit
I am extracting entire data of pdf and converting it into json format.
PDF is Structured
currently my approach is
1 extraction of text from pdf using pymupdf
2 creating chunks of data (as pdf can be of 8 to 70 pages)
3. converting the extracted data into json
problem is sending request N no. of times increasing time and i am getting final output in too much time
example around for 16 pages of pdf it is taking 6minutes

anon10827405 · December 23, 2024, 5:56pm

I’m guessing that you are sequentially sending the results. Use parallelization.

ashdixit2003 · December 26, 2024, 2:46pm

Ok thankyou. by the way i am using asynchronous request now
can you tell me if sending these request asynchronously affects the output of request or
interchanging the result of request

anon10827405 · December 26, 2024, 5:05pm

No. The result will be the same and you won’t be able to use it until it’s ready.

Also, make sure that you are using async correctly to ensure that the API is being called in parallel, and not sequentially.

Topic		Replies	Views
Financial Assistant Prompting	6	809	June 4, 2021
Accurately read PDF files? API	12	80429	December 12, 2023
How can I send PDF text to replace some words and tell to join the broken sentences Prompting gpt-4 , gpt-35-turbo , chatgpt , pdf	12	1838	January 31, 2024
Aggregated answer across multiple documents (Q&A) API	6	3432	March 14, 2023
Pre-tokenization with python? API	0	463	September 9, 2022

Conversion of entire PDF into JSON Format

Related topics