What is the best strategy to split HTML into evenly sized chunks

ashish.fermilab · September 20, 2023, 7:11pm

I am exploring the ways HTML file could be split into evenly sized chunks which could be provided as input to GPT4. My use case requires translation of text in the HTML file and maintain the original structure in the final translated document. To avoid token limits, I intend to split the original file into chunks which could be merged back after the translation to reconstruct the final HTML. Any strategy which has worked before?

wclayf · September 20, 2023, 7:17pm

You’ll need to google for “HTML Parsers” and find a library that can parse HTML and give you back a DOM Tree that you can iterate over like walking a tree strcture.

If you’re in Javascript the DOM in JS itself can easily do this (browser already has it), and it will be trivial. If you’re in Java or some other language you’ll need to find an actual HTML Parser.

Any parser, btw, will likely have a method to “generate” the HTML back from the syntax tree after you modify the tree structure.

ashish.fermilab · September 20, 2023, 7:23pm

Thanks for your suggestion. I used BeautifulSoup Library in Python to parse HTML and traverse through the tags. I can translate the text enclosed between tags and get the final translated document. But, it requires too many API calls. That’s why I was looking for a systematic approach to better segment the HTML file into sections which I can feed in one API request.

PaulBellow · September 20, 2023, 7:25pm

Are you just sending the contents of each

tag or something? You’ll want to find a way to append several of them until you get to optimal token length. Not sure if I’ve heard of anything that does it off the shelf, so if you do it, consider open sourcing it on GitHub as I’m sure others will be interested.

Welcome to the OpenAI Developer Community, by the way!

wclayf · September 20, 2023, 8:31pm

Got it. I would first be sure I have a way to assign an identifier to each node in the Parse Tree, and then build JSON that contains an array of things to translate to feed to GPT. Each element in the array will be just an object with “id” and “text” properties.

So you’ll be telling GPT “Please translate the text properties of this JSON and generate for me the translated json, including all IDs”. I bet it’s smart enough to do that.

Then just send batches of N number of array items. Each time you get a response back you can use the IDs to stick the content into the Parse Tree.

EDIT: Don’t forget you might be able to even describe this program in enough detail to get GPT to WRITE THE WHOLE THING for ya!!! I bet it can do this.

EDIT 2: Tell it what parser you’re using in Python. Tell it you want it to generate all the necessary OpenAI calls to individually translate each piece of text in the Parse Tree.

bruno.vaz · November 21, 2023, 9:52am

Hi!

Do you know of any way of properly converting HTML to Markdown in Python? I’ve tried markdownify and html2text, but when the HTML has some complex table the conversion is quite bad.

Thanks!

elmstedt · November 21, 2023, 3:49pm

Pandoc.

hifiveszu · December 8, 2023, 9:01am

Hi,

Try out our ChatofAI free parsing API to convert files like PDF, Docx, HTML, Excel, CSV, and more to Markdown format, especially for handling tables.

bruno.vaz · December 8, 2023, 9:55am

Thanks!! I’ll make sure to give it a look

Topic		Replies	Views
Accurately read PDF files? API	12	53524	December 12, 2023
GPT-3 to markup a document API	1	934	December 16, 2023
Does ChatGPT have a built-in Markdown parser API	6	10881	April 11, 2023
Can't get the API to convert to HTML for my front end Prompting gpt-4 , output-html , output-markdown	19	3752	November 25, 2023
Html in text uploaded via files api API	2	964	May 4, 2022

What is the best strategy to split HTML into evenly sized chunks

Related Topics