How to prepare data for AI Assistant?

orlov · July 5, 2024, 11:56am

I have a couple of questions about data preparation for RAG/AI Assistants/Vector database.

I have knowledge file containing

url
meta title
meta description
H1
content (at the moment the body with HTML).

Questions on this:

Do I need

to cleanup HTML, so only text remains?
to cleanup text, so only text from semantic elements remains, like main/article (and elements like nav, header, aside, footer are removed)?
If cleanup HTML, is it a problem, that the content from is no longer a table?
Do I need to calculate embeddings by myself before import? Or do AI Assistants/Vector databases (Qdrant) calculate embeddings by theirself after import?
Are chunked csv/json enough, or should I create additional metadata?

_j · July 5, 2024, 2:10pm

You have two concerns:

the AI-based embeddings that extracts semantic meaning from parsing the chunked text
the AI that receives chunked knowledge and its understanding of the contents

For both of these, HTML is not ideal. It has almost double the token consumption, and also has common semantic meaning with other web page markup chunks and less common meaning with the AI’s searches.

The AI receives the language as it is extracted from documents, or the plain text of plain documents. You can use a html to markdown converter to maintain a bit of structure to the plain text at lower cost rather than merely stripping elements and structure and even paragraphs. Markdown has tables. This forum supports markdown tables.

Aspect	HTML Description	Markdown Description	Advantage
Syntax Complexity	HTML uses tags enclosed in angle brackets (e.g., `<b></b>`, `<i></i>`)	Markdown uses simpler syntax with special characters (e.g., `bold`, `_italic_`)	Markdown
Learning Curve	Requires understanding of various tags and attributes	Easier to learn and use for beginners	Markdown
Flexibility	Highly flexible with extensive formatting options and precise control	Less flexible but covers most common formatting needs	HTML
AI Parsing	AI models may struggle with complex nested tags and extensive attributes	Simpler structure makes it easier for AI models to parse and understand	Markdown

orlov · July 5, 2024, 3:52pm

@_j
Thank you very much, you confirmed my guess.

Do you maybe know any tools or libraries, considering as standard for such cleanups? Or does everybody manage this task on best guess? BeautifulSoap, JSSoap, HTMLtidy…?

The cleanup task seems to be not complicated:

Get meta title and meta description
Exclude non-semantic elements (head, nav, header, aside, footer)
Exclude images and links, but remain anchors and ALTs
Exclude social network links completely with anchors
Convert remaining staff to markdown, while maintaining headings, lists, tables, blockquotes and bold/italic.

My previous idea was to scrape something like Reader View to get simplified pages.

_j · July 5, 2024, 4:23pm

Something like Selenium delivers high-quality text, but also is incredibly heavy and requires hosted pages. So we discount render techniques.

Instead, head over here, where the link is to “html-to-markdown”:

Topic		Replies	Views
How to prepare the content of HTML page for embeddings calculation API api	2	950	June 6, 2024
Best way to save html files in vector store API langchain	4	7446	October 9, 2023
Best Table Text Format for Embeddings Generation Prompting embeddings , api	2	1362	July 30, 2024
How to deal with unstructured data scraping for a website using AI? API vector-db	1	3339	July 17, 2024
Html in text uploaded via files api API	2	1476	May 4, 2022

How to prepare data for AI Assistant?

Related topics