You need to state upfront what your application is. âthen the model has a lot of trouble finding the right context for an answerâ is the only hint that you want to use database retrieval for knowledge augmentation of a chatbot by message injection.
You need to consider:
what data is it important to preserve in your database as a permanent fixture;
what chunking method and notation of chunk in document will you use;
what augmentation will you supply for matching, such as example questions;
what data will you send to embeddings AI to obtain a semantic vector;
what data will you present to the AI for maximum understanding and augmentation (and minimum tokens)
When chunking long documents into logical divisions can also use intelligence.
If you are looking for a similarity match score based on your inputs, and your inputs are also HTML, I donât see why that couldnât work. However, you are likely taking user input prompts and need to have useful returns based on the context of the question.
HTML format doesnât really help us for matching to user question input. If you embed that, you likely get more semantic embedding of characteristics like âlooks like codeâ, âinternetâ, etcâŚwell, thatâs great if they are asking about code, but here they are asking about the knowledge inside.
I would store the data in a format that can be returned to AI, along with the (probably different text) vector of what embeddings you used to generate the matches. Useful text would be document name, chunk number, section name. Even better if you can sort the top-n results to the original chunk order and simulate a reconstruction of documents when passing multiples to AI.
The AI can understand markdown well, and markdown is also the typical AI GUI. If you can parse HTML tables into markdown, the AI can just repeat them back.
Yes, the application is a chatbot that uses custom data to answer.
Data important to preserve: all paragraphs and tables.
Chunking method: line-by-line (usually a bunch of lines at a time)
I strip the html of all tags (preserving table structure with [TABLE_START] and [TABLE_END], so the tables never get split), then store the data in JSON format with each chunkâs content and its metadata.
I am using Pinecone to store the embeddings.
Can you link a resource for soring top-n results to the original chunk order and simulating a reconstruction of documents, if such exists, please?
The metadata could be inserted into the text itself as both a sort criteria and to give context to the AI.
Consider for each chunk, which can be longer than a few lines to add context to the segment:
{âtitleâ:â4 Tips for Spotting AI-Generated Picsâ,
âsummaryâ: âDespite occasional glitches, AI content generators have improved over time, raising concerns about potential plagiarism and the spread of realistic âdeepfakeâ images that can lead to misunderstandings.â,
âdocument numberâ: 43,
âchunk numberâ: 3,
âchunk textâ: âFor example, in the days surrounding Manhattan prosecutorsâ move to criminally charge former U.S. President Donald Trump, the Snopes newsroom saw a wave of AI-generated content on social media with Trump as its main character.
With varying views on the case, AI-software users posted all sorts of fictional scenes featuring Trump or his rivals â from imaginary clashes between Trump and law enforcement officers to jail booking mugshots to depictions of his brief stay in New York last week for his arraignment. The images seemingly fooledâ
}
You can see how a summary brings the returns of a particular document closer together in semantic search, then you can do a sort by the document and chunk numbers. The JSON meta is even useful to the AI - and yet you could remove it on consecutive pieces retrieved to rejoin the relevant parts of any document.
Just a basic programming layer between the database return and the AI injection.
However you have to stick the stuff in your data also.
Then imagine the power of giving the AI a function where it can retrieve more by document and chunk.
I like that design. Another possibility in addition to having the full summary in each chunk would be to include either 1) the top N most frequently used keywords from from each document, or 2) most âimportantâ keywords from each doc.
You could even use ChatGPT itself to ask it to just give you the top most important keywords in each entire document, and then embed that list into each chunk.
EDIT: So it would be a new JSON property thatâs just a word list ,added to the chunk JSON.