Seeking Help with Formatting a Large LaTeX File

I’m a newbie in programming, especially with APIs. This question has probably already been asked, but I have an issue with a LaTeX file of 40,000 characters. I would like ChatGPT to take the entire file as input to better format everything, trying to make the text more coherent. However, it’s important that it works on the entire text and not on separate parts. How can I do that?

Look at,

This is a problem I’m currently working on myself as well and, so far, this is the best I’m seeing (at least for academic PDFs).

I’ve already drafted the entire document in LaTeX. My goal is to print the whole chapter without segmenting it. Is introducing this feature the optimal solution to ensure ChatGPT can fully grasp the content?

The largest context length that is available via API is gpt-3.5-turbo-16k.

What that means for you is a maximum of around 5000 words input and 5000 words output. And significantly less with the overhead of document formatting.

So, no, there is no OpenAI model that can process such a file at once (and it must be presented to the AI as text). You would need to use chunking techniques. For ensuring quality writing, there’s no need to pass the whole document. In fact, the larger the context, the more the AI is likely to do absolutely no rewriting and just give the same text right back to you.

\LaTeX is, in my experience, an incredibly “token-heavy” format.

Don’t get me wrong, it’s great and I use it daily, but it’s not efficient (though it’s considerably better than HTML). There’s a good reason why the models tend to output their content in markdown.

So I would, at a minimum, convert your document (as much as is possible) to markdown and do a token-count comparison between the two.

It’ll be very heavily document-specific, but I can’t imagine the markdown document wouldn’t be at least 5%–10% smaller than the \LaTeX.

Another thing to ask yourself is whether or not there is any real benefit to putting the entire document into context at once. Generally, in most situations, using RAG will get you better, faster, and cheaper results.

I understand you’re looking for the model to make the document more coherent so you think putting the entire doc into context and having it output an entirely new document is the way to go (and in a perfect world with unlimited context it might be) but I fear that path will be fraught with difficulty.

I would suggest a divide-and-conquer approach instead.

I’m going to make some assumptions and try to give you a bit of a roadmap.

40,000 characters is approximately 10,000 tokens.

Let’s assume you’re using GPT-4 (8k) to get the best results you can. Also assume maybe 500-1,000 tokens for a great system message, and that the token-count of the target document will be roughly the same as that of the initial document.

This means you’ll need to, at minimum, do this in three, maybe four transactions. But, I think you’ll need more than that.

The first thing I would do is try to chunk the document into pieces no more than say, 1,500 tokens. Some might be much smaller, but none should be larger.

Then, I’d probably summarize them myself or ask gpt-4 to do so. Basically, you want to capture the “flow” of the document but make it small enough to fit into about a third of the available context window.

Once you have that you can get gpt-4 to review the overall structure of the document to make sure it’s not jumping around from point to point or circling back on itself.

Once you have a revised structure/outline you can ask it to, for each chunk, determine what absolutely needs to be established before that chunk and what that chunk’s core purpose is.

Then, starting with the first two chunks and relevant bits of your augmented outline, have the model rewrite the first chunk. Then with the new first chunk, the second and third original chunks, and your plan rewrite the second chunk. Then with 2, 3, and 4 rewrite 3; 3, 4, and 5 rewrite 4, etc.

Based on what you’ve shared, you should have 8–10 chunks in total, and I suspect that by the time you get to the end you’ll have a more “cohesive” document.

Alternately, ask someone with access to gpt-4-32k to do it for you. Figure two or three passes tops,

  1. Ask the model for a critical analysis of the document.
  2. Ask the model to rewrite the document based on the analysis.
  3. Ask the model to review the rewrite.

Alternately, alternately, use Claude 2 with it’s 100k context.

Rough conservative estimates would be on the order of about 40k input tokens and 20k output tokens. Final price under $5.