I’m creating a generative AI ChatBot where users ask questions in regards to various product offerings.
The majority of this information is located in tables/matrices in our internal Knowledge Base articles (HTML). To ensure accurate answers are being generated, I was curious what would be the best text format to generate embeddings from tables/matrices?
Several of these tables/matrices have rowspans where one value is used in multiple rows but only displayed once, embedded tables, or even ordered/numbered lists.
My initial though was to convert the HTML to Markdown and repeat any rowspan value on each line…but Markdown doesn’t support embedded tables or lists (unless in HTML) and to reduce size/cost/noise I don’t want to include HTML when generating the embeddings.
Otherwise, I think your plan of converting docs to markdown first should work great. I do the same for my production apps. And yeah, repeat rowspans as you planned, since default Assistants API RAG just uses chunks and vector search underneath (as far as I know).