So i am encountering this problem, where i only have to extract the tables in a pdf and i am new to this so am having a hard time understanding it. theres something wrong with my prompt? or code and what are the best practices or solutions to this. please #developersassemble
here is my code:
!pip install langchain
!pip install langchain-community
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_community.document_loaders import PyPDFLoader
import json
Initialize the ChatOpenAI model with your OpenAI API key
chat = ChatOpenAI(openai_api_key=“api-key”)
Load the PDF document
loader = PyPDFLoader(r"C:\Users\Rajeev\Desktop\table_extraction\MsWord Sample Service Order Format.pdf")
docs = loader.load()
Define a prompt template for extracting tables
table_extraction_prompt = “”"
You are an AI trained to extract tables from text. Extract and format all tables found in the provided text into a JSON object.
The JSON format should be as follows:
[
{{
“page”: <page_number>,
“tables”: [
{{
“table_id”: <unique_table_id>,
“headers”: [<list_of_column_headers>],
“rows”: [
[<row1_values>],
[<row2_values>],
…
]
}}
]
}}
]
Here’s the text:
{text}
“”"
Create a prompt template with the defined prompt
prompt_template = PromptTemplate(
input_variables=[“text”],
template=table_extraction_prompt
)
Create an LLMChain instance
llm_chain = LLMChain(
llm=chat,
prompt=prompt_template
)
Process each document and extract tables
all_tables =
for idx, doc in enumerate(docs):
page_number = idx + 1
text_content = doc.page_content
# Ensure the correct input format for the LLMChain
inputs = {"text": text_content}
# Extract tables using the LLMChain
result = llm_chain.run(inputs)
# Parse the result into JSON
try:
tables_json = json.loads(result)
for table in tables_json:
table["page"] = page_number
all_tables.extend(tables_json)
except json.JSONDecodeError:
print(f"Failed to decode JSON from result for page {page_number}")
Convert the extracted tables to JSON format
json_output = json.dumps(all_tables, indent=4)
Print the JSON output
print(json_output)
heres the error -
RateLimitError: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
I know we need a paid version but this does not work for gpt3.5 as well, please let me know how should i do it and enhance it,only to extract tables in tabular format.