The data will not be chunked by “logical JSON snippet”. It will be chunked by token counts of documents.
Your input cost with limitless top data can be 20 x (800) or even 20 x (1200 or 1600) - depending on how their overlap documented is interpreted. The only constraint would be the probability of document end “tails” there are with less than 800, if they don’t simply ensure the last chunk is also 800 from end.
You could do some clever preprocessing if performing embeddings on the initial single JSON yourself. After obtaining embeddings (perhaps 2B tokens?) they could then be ranked in 1D space by distances from tasks or whatever amount of nearly free iterative computations you can do overnight on a workstation to sort. To then cluster yourself for highly focused sections.
Of course, if paying for embeddings once on individual items with embeddings targeting the actual data, why would you continue to pay daily for GB of vector database that is (DATA * (>150%) + 1Kvector) * chunks which is confused by literal mixed messaging?