Table extraction using langchain and gpt3.5 or 4o

polerajeev7 · August 30, 2024, 8:07am

So i am encountering this problem, where i only have to extract the tables in a pdf and i am new to this so am having a hard time understanding it. theres something wrong with my prompt? or code and what are the best practices or solutions to this. please #developersassemble

here is my code:

!pip install langchain
!pip install langchain-community

from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_community.document_loaders import PyPDFLoader
import json

Initialize the ChatOpenAI model with your OpenAI API key

chat = ChatOpenAI(openai_api_key=“api-key”)

Load the PDF document

loader = PyPDFLoader(r"C:\Users\Rajeev\Desktop\table_extraction\MsWord Sample Service Order Format.pdf")
docs = loader.load()

Define a prompt template for extracting tables

table_extraction_prompt = “”"
You are an AI trained to extract tables from text. Extract and format all tables found in the provided text into a JSON object.
The JSON format should be as follows:
[
{{
“page”: <page_number>,
“tables”: [
{{
“table_id”: <unique_table_id>,
“headers”: [<list_of_column_headers>],
“rows”: [
[<row1_values>],
[<row2_values>],
…
]
}}
]
}}
]

Here’s the text:
{text}
“”"

Create a prompt template with the defined prompt

prompt_template = PromptTemplate(
input_variables=[“text”],
template=table_extraction_prompt
)

Create an LLMChain instance

llm_chain = LLMChain(
llm=chat,
prompt=prompt_template
)

Process each document and extract tables

all_tables =

for idx, doc in enumerate(docs):
page_number = idx + 1
text_content = doc.page_content

# Ensure the correct input format for the LLMChain
inputs = {"text": text_content}

# Extract tables using the LLMChain
result = llm_chain.run(inputs)

# Parse the result into JSON
try:
    tables_json = json.loads(result)
    for table in tables_json:
        table["page"] = page_number
    all_tables.extend(tables_json)
except json.JSONDecodeError:
    print(f"Failed to decode JSON from result for page {page_number}")

Convert the extracted tables to JSON format

json_output = json.dumps(all_tables, indent=4)

Print the JSON output

print(json_output)

heres the error -

RateLimitError: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}

I know we need a paid version but this does not work for gpt3.5 as well, please let me know how should i do it and enhance it,only to extract tables in tabular format.

Topic		Replies	Views
Langchain app which reads PDFs using openai embeddings and model API gpt-4	2	2544	December 13, 2023
RateLimitError: Error code: 429 while running a RAG application consisting gpt-4oAPI,Pinecone vector store GPT builders chatgpt , pinecone	0	314	July 4, 2024
Need help with ChatGPT API API gpt-35-turbo , chatgpt , api	1	1289	September 21, 2023
Read into pdf and output table API gpt-4 , chatgpt , api	10	10004	September 9, 2023
Gpt-3.5-turbo , to generate correct responses API	8	1616	January 2, 2024