How to prepare data for AI Assistant?

I have a couple of questions about data preparation for RAG/AI Assistants/Vector database.

I have knowledge file containing

  • url
  • meta title
  • meta description
  • H1
  • content (at the moment the body with HTML).

Questions on this:

Do I need

  1. to cleanup HTML, so only text remains?
  2. to cleanup text, so only text from semantic elements remains, like main/article (and elements like nav, header, aside, footer are removed)?
  3. If cleanup HTML, is it a problem, that the content from is no longer a table?
  4. Do I need to calculate embeddings by myself before import? Or do AI Assistants/Vector databases (Qdrant) calculate embeddings by theirself after import?
  5. Are chunked csv/json enough, or should I create additional metadata?

You have two concerns:

  1. the AI-based embeddings that extracts semantic meaning from parsing the chunked text
  2. the AI that receives chunked knowledge and its understanding of the contents

For both of these, HTML is not ideal. It has almost double the token consumption, and also has common semantic meaning with other web page markup chunks and less common meaning with the AI’s searches.

The AI receives the language as it is extracted from documents, or the plain text of plain documents. You can use a html to markdown converter to maintain a bit of structure to the plain text at lower cost rather than merely stripping elements and structure and even paragraphs. Markdown has tables. This forum supports markdown tables.

Aspect HTML Description Markdown Description Advantage
Syntax Complexity HTML uses tags enclosed in angle brackets (e.g., <b></b>, <i></i>) Markdown uses simpler syntax with special characters (e.g., **bold**, _italic_) Markdown
Learning Curve Requires understanding of various tags and attributes Easier to learn and use for beginners Markdown
Flexibility Highly flexible with extensive formatting options and precise control Less flexible but covers most common formatting needs HTML
AI Parsing AI models may struggle with complex nested tags and extensive attributes Simpler structure makes it easier for AI models to parse and understand Markdown
1 Like

@_j
Thank you very much, you confirmed my guess.

Do you maybe know any tools or libraries, considering as standard for such cleanups? Or does everybody manage this task on best guess? BeautifulSoap, JSSoap, HTMLtidy…?

The cleanup task seems to be not complicated:

  • Get meta title and meta description
  • Exclude non-semantic elements (head, nav, header, aside, footer)
  • Exclude images and links, but remain anchors and ALTs
  • Exclude social network links completely with anchors
  • Convert remaining staff to markdown, while maintaining headings, lists, tables, blockquotes and bold/italic.

My previous idea was to scrape something like Reader View to get simplified pages.

Something like Selenium delivers high-quality text, but also is incredibly heavy and requires hosted pages. So we discount render techniques.

Instead, head over here, where the link is to “html-to-markdown”:

1 Like