Hello… I am trying to build a solution which can read annual reports and product catalogs of a Publicly Listed Company or/and crawl the web for better insights and give an output which can firstly summarize what it understood and answer questions specific to the company or the the solutions.
I’ll appreciate any and all inputs as to how I can move forward as I am stuck at how i can read a PDF or web URL as a JSONL. I have only just begun exploring ML and Python.
Thanks for taking your time to read my problem and thanks again for providing any inputs.
Hi there, I’m not aware of a one-step process to convert a URL to JSONL, but you could try URL → CSV and CSV → JSONL, for example.
Here’s some simple Python code to turn a CSV to JSONL:
def csv_to_jsonl(input, output):
df = pandas.read_csv(input) # assumes first line has column names
df.to_json(output, orient="records", lines=True)
There are a number of ways to turn a URL (HTML) into CSV, though the easiest might be to use an existing tool. You can also check out this Stack Overflow thread.
For example, you could have a scraper (such as BeatifulSoup) to scrape from websites given URLs. Then you will need to need build a script to convert scraper result to JSONL (the json library can help you, but it will mostly have to be hard-coded). Finally you can provide the data to the answers API to answer questions and to the completions endpoint to summarise, by providing the scraped results + TLDR; in the prompt.
If the whole website does not fit in the prompt, then there are workaround but it becomes more complicated.
Thanks for the assist, now that you shared this seems so much more easy.
Thanks Nicholas for the workaround, much appreciated.
Hi @mohaktnbt ,
I created a python script for quickly converting PDFs to JSONL.
If this is useful, free to try it out directly in Google Colab:
Example of the tool converting a PDF to JSONL, broken up by sentence:
Thanks, helpful for sure.