Best way to save html files in vector store

Sorry, a bit off topic for this forum.

I have a directory with a bunch of html files.

HTML files have a lot of tables in them.

What would be the best way to store these files in a vector store like Pinecone with their metadata?

Right now I am following these steps:

  1. Turn html into text
  2. Chunk the text up
  3. Embed the chunks
  4. Store embedding a in Pinecone.

However, a lot of tables end up split up during chunking and then the model has a lot of trouble finding the right context for an answer?

Is there an optimal architecture for accomplishing this?
Thank you in advance!

You need to state upfront what your application is. “then the model has a lot of trouble finding the right context for an answer” is the only hint that you want to use database retrieval for knowledge augmentation of a chatbot by message injection.

You need to consider:

  • what data is it important to preserve in your database as a permanent fixture;
  • what chunking method and notation of chunk in document will you use;
  • what augmentation will you supply for matching, such as example questions;
  • what data will you send to embeddings AI to obtain a semantic vector;
  • what data will you present to the AI for maximum understanding and augmentation (and minimum tokens)

When chunking long documents into logical divisions can also use intelligence.

If you are looking for a similarity match score based on your inputs, and your inputs are also HTML, I don’t see why that couldn’t work. However, you are likely taking user input prompts and need to have useful returns based on the context of the question.

HTML format doesn’t really help us for matching to user question input. If you embed that, you likely get more semantic embedding of characteristics like “looks like code”, “internet”, etc…well, that’s great if they are asking about code, but here they are asking about the knowledge inside.

I would store the data in a format that can be returned to AI, along with the (probably different text) vector of what embeddings you used to generate the matches. Useful text would be document name, chunk number, section name. Even better if you can sort the top-n results to the original chunk order and simulate a reconstruction of documents when passing multiples to AI.

The AI can understand markdown well, and markdown is also the typical AI GUI. If you can parse HTML tables into markdown, the AI can just repeat them back.

1 Like

Yes, the application is a chatbot that uses custom data to answer.

  1. Data important to preserve: all paragraphs and tables.
  2. Chunking method: line-by-line (usually a bunch of lines at a time)

I strip the html of all tags (preserving table structure with [TABLE_START] and [TABLE_END], so the tables never get split), then store the data in JSON format with each chunk’s content and its metadata.

I am using Pinecone to store the embeddings.

Can you link a resource for soring top-n results to the original chunk order and simulating a reconstruction of documents, if such exists, please?

Thank you.

The metadata could be inserted into the text itself as both a sort criteria and to give context to the AI.

Consider for each chunk, which can be longer than a few lines to add context to the segment:

{“title”:“4 Tips for Spotting AI-Generated Pics”,
summary”: “Despite occasional glitches, AI content generators have improved over time, raising concerns about potential plagiarism and the spread of realistic “deepfake” images that can lead to misunderstandings.”,
document number”: 43,
chunk number”: 3,
chunk text”: “For example, in the days surrounding Manhattan prosecutors’ move to criminally charge former U.S. President Donald Trump, the Snopes newsroom saw a wave of AI-generated content on social media with Trump as its main character.
With varying views on the case, AI-software users posted all sorts of fictional scenes featuring Trump or his rivals — from imaginary clashes between Trump and law enforcement officers to jail booking mugshots to depictions of his brief stay in New York last week for his arraignment. The images seemingly fooled”
}

You can see how a summary brings the returns of a particular document closer together in semantic search, then you can do a sort by the document and chunk numbers. The JSON meta is even useful to the AI - and yet you could remove it on consecutive pieces retrieved to rejoin the relevant parts of any document.

Just a basic programming layer between the database return and the AI injection.

However you have to stick the stuff in your data also.

Then imagine the power of giving the AI a function where it can retrieve more by document and chunk.

1 Like

I like that design. Another possibility in addition to having the full summary in each chunk would be to include either 1) the top N most frequently used keywords from from each document, or 2) most ‘important’ keywords from each doc.

You could even use ChatGPT itself to ask it to just give you the top most important keywords in each entire document, and then embed that list into each chunk.

EDIT: So it would be a new JSON property that’s just a word list ,added to the chunk JSON.