How to generate embeddings from jsdocs?

Would really like some suggestions or pointers on the best way to generate embeddings from our jsdocs. The end goal is to provide a chatbot to our users that understands our documentation / API fully.

We could chunk and build embeddings from the HTML that our docs creates, of course, but I was curious if there is a better approach, perhaps using the JSON / AST that is generated as part of the jsdoc process instead?



Hi and welcome to the forum!

The topic of the best way to embed data is a large one, typically, the better the data matches what it will be searched against the better, if you are putting JSON structures into your embeds, be aware that you will be paying for and encoding a great deal of extra syntax, if you expect your searches to be in this format then it’s certainly worth trying.

As embeddings do not require large amounts to become useful, unlike fine-tuning, it is possible to embed a subsection test case and evaluate that to improve iteration loop times and ease progress.

As this entire field is very new, there are still best practices to be found and areas where trail and error are worth while.

There are some key points, such as overlapping data, where one chuck contains a percentage of the chuck before and after to allow for cross chunk boundary relevancy and ensuring your input data is consistent across chunks, i.e. not going from JSON serialised text to plain text inconsistently.

As the embedding tokens are very cheap to perform ($0.0002 per 1k) it’s worth creating some R&D sets to get the best results for your corpus.