I have a couple of questions about data preparation for RAG/AI Assistants/Vector database.
I have knowledge file containing
url
meta title
meta description
H1
content (at the moment the body with HTML).
Questions on this:
Do I need
to cleanup HTML, so only text remains?
to cleanup text, so only text from semantic elements remains, like main/article (and elements like nav, header, aside, footer are removed)?
If cleanup HTML, is it a problem, that the content from is no longer a table?
Do I need to calculate embeddings by myself before import? Or do AI Assistants/Vector databases (Qdrant) calculate embeddings by theirself after import?
Are chunked csv/json enough, or should I create additional metadata?
the AI-based embeddings that extracts semantic meaning from parsing the chunked text
the AI that receives chunked knowledge and its understanding of the contents
For both of these, HTML is not ideal. It has almost double the token consumption, and also has common semantic meaning with other web page markup chunks and less common meaning with the AI’s searches.
The AI receives the language as it is extracted from documents, or the plain text of plain documents. You can use a html to markdown converter to maintain a bit of structure to the plain text at lower cost rather than merely stripping elements and structure and even paragraphs. Markdown has tables. This forum supports markdown tables.
Aspect
HTML Description
Markdown Description
Advantage
Syntax Complexity
HTML uses tags enclosed in angle brackets (e.g., <b></b>, <i></i>)
Markdown uses simpler syntax with special characters (e.g., **bold**, _italic_)
Markdown
Learning Curve
Requires understanding of various tags and attributes
Easier to learn and use for beginners
Markdown
Flexibility
Highly flexible with extensive formatting options and precise control
Less flexible but covers most common formatting needs
HTML
AI Parsing
AI models may struggle with complex nested tags and extensive attributes
Simpler structure makes it easier for AI models to parse and understand
Do you maybe know any tools or libraries, considering as standard for such cleanups? Or does everybody manage this task on best guess? BeautifulSoap, JSSoap, HTMLtidy…?
The cleanup task seems to be not complicated:
Get meta title and meta description
Exclude non-semantic elements (head, nav, header, aside, footer)
Exclude images and links, but remain anchors and ALTs
Exclude social network links completely with anchors
Convert remaining staff to markdown, while maintaining headings, lists, tables, blockquotes and bold/italic.
My previous idea was to scrape something like Reader View to get simplified pages.