Hey everyone,
I’m thinking through issues of preparing documentation for consumption by both Humans and AI and the best ways to optimize a given document for both.
There are many conversations about semantic chunking and automatically doing what I’m suggesting—but instead, what I’m thinking about is how to formally structure a document in the first place to act as formal, canonical training with as much of the original meaning preserved as possible.
I have several questions if anyone has thoughts:
Where Do Embeddings Go?
Best as I can tell, the location of the embeddings with regard to the data is up for grabs. You can include them as a column in structured data, you can include them as meta data.
It seems that if a section is properly denoted as “Embeddings” then a GPT model will interpret them as such?
So, hypothetically they could be included as chunks of information in unstructured data?
For example, if I were to include a section that had embeddings for THIS section of my post, and it was clearly demarkated as such, would a Model understand what to do with them?
[Section Embeddings: 0.0000198, 0.00008389, 0.033093839, 0.039983839, 0.0393039… ]
Preparing Formal AI Documentation
If it’s true that AI would interpret an Embedding wherever it happens upon it, and that this would help it better to understand the information in the section the embedding is related to, then I propose that it is possible to formally prepare a document using the following pattern. HTML semantic structure is used to help everyone visualize the structure of the document, and I think actually including these quasi-html tags in the document would help for faster digestion:
<h1> Document Title </h1>
<embeddings > Title Long Embeddings </embeddings>
<h2>Document Purpose and Summary<h2>
<embeddings> Document Purpose and Summary Embeddings <embeddings>
<h2> Table of Contents </h2>
I imagine some form of table of contents to allow document search and understanding, if there's some way to structure it for faster search?
<h2> Chapter 1 </h2>
<heading embeddings> long heading embeddings </heading embeddings>
<keywords in section> section keywords long embeddings </keywords>
<paragraph embeddings> here are the embeddings for the below paragraphs to unknown depth. I imagine they'd be "short embeddings." </paragraph embeddings>
<p> paragraph 1 </p>
<p> paragraph 2 </p>
<p> paragraph 3 </p>
<h3> Chapter 1: Subsection 1 </h3>
<embeddings> Long Heading embeddings </embeddings>
<keywords in section> Section keyword long embeddings </keywords>
<paragraph embeddings> Short paragraph embeddings of unknown depth <paragraph embeddings>
How Deep Do You Go?
The goal here is to create an ultra-understandable document for deep integration in the AI's training; to approach something like zero-shot learning for specifically and formally prepared documentation for a given project.The idea is to provide this level of attention to deeply important and canonical-type documentation that needs to be readily, immediately, and completely understood to reduce errors down the line, *while* that documentation is being prepared by the original author. (I postulate that the original author(s) of a given document would be better able to preserve their original meaning if this method were used.)
I think [using a combination of long embeddings](https://platform.openai.com/docs/guides/embeddings/embedding-models) for greater accuracy in headings and keyword sections, and short embeddings for body sections, would help accuracy while keeping the overall document size manageable.
Given this goal, and if a model even reads embeddings if they're provided in a document like this in the first place, how "deep" would you make the embeddings for the actual body-information?
One can imagine either only going so far into a given paragraph, relying on the overall structure to allow a model to search a given chunk more thoroughly in the future if it became relevant; OR it could also be imagined that an Embeddings section would follow each body paragraph.
What do y'all think?