I am exploring the ways HTML file could be split into evenly sized chunks which could be provided as input to GPT4. My use case requires translation of text in the HTML file and maintain the original structure in the final translated document. To avoid token limits, I intend to split the original file into chunks which could be merged back after the translation to reconstruct the final HTML. Any strategy which has worked before?
You’ll need to google for “HTML Parsers” and find a library that can parse HTML and give you back a DOM Tree that you can iterate over like walking a tree strcture.
If you’re in Javascript the DOM in JS itself can easily do this (browser already has it), and it will be trivial. If you’re in Java or some other language you’ll need to find an actual HTML Parser.
Any parser, btw, will likely have a method to “generate” the HTML back from the syntax tree after you modify the tree structure.
Thanks for your suggestion. I used BeautifulSoup Library in Python to parse HTML and traverse through the tags. I can translate the text enclosed between tags and get the final translated document. But, it requires too many API calls. That’s why I was looking for a systematic approach to better segment the HTML file into sections which I can feed in one API request.
Are you just sending the contents of each
tag or something? You’ll want to find a way to append several of them until you get to optimal token length. Not sure if I’ve heard of anything that does it off the shelf, so if you do it, consider open sourcing it on GitHub as I’m sure others will be interested.
Welcome to the OpenAI Developer Community, by the way!
Got it. I would first be sure I have a way to assign an identifier to each node in the Parse Tree, and then build JSON that contains an array of things to translate to feed to GPT. Each element in the array will be just an object with “id” and “text” properties.
So you’ll be telling GPT “Please translate the text properties of this JSON and generate for me the translated json, including all IDs”. I bet it’s smart enough to do that.
Then just send batches of N number of array items. Each time you get a response back you can use the IDs to stick the content into the Parse Tree.
EDIT: Don’t forget you might be able to even describe this program in enough detail to get GPT to WRITE THE WHOLE THING for ya!!! I bet it can do this.
EDIT 2: Tell it what parser you’re using in Python. Tell it you want it to generate all the necessary OpenAI calls to individually translate each piece of text in the Parse Tree.
It would be very helpful if you could provide an example HTML document for reference.
Generally, one thing I strongly recommend is to convert the HTML to markdown first, as it typically requires far fewer tokens than HTML and the models are more “fluent” in it.
So, unless there’s a great deal of esoteric tags and structure going on you’ll likely get better results.
Alternately, it’s entirely possible you could simply send the entire contents of the HTML and ask it to translate it and it would do fine.
Again seeing an example would go a long way.
Hi!
Do you know of any way of properly converting HTML to Markdown in Python? I’ve tried markdownify
and html2text
, but when the HTML has some complex table the conversion is quite bad.
Thanks!
Pandoc.