How to best structure CSV embeddings to elicit clear and correct answers from

Hi

Wondering if anyone can help.

I’m currently creating a csv embedding for an AI chatbot that provides answers to user queries about certain aspects of a utilities market.

What I am finding is that when I list certain standards, the AI is able to create an accurante response however when I try and ask the AI to list these all in a group I find that only certain standards are being listed with some being left out.

I was wondered what the community finds to be the best structure when creating CSV embedding? Is a question / answer structure the best way to go or is it better to categorise each group of questions and response and where certain response linked back to a transaction ID or Market group, create separate columns for each?

Curious to hear your thoughts. Really want to avoid certain answers in my embedding getting parsed together in the overall response and ensure every standard in the embedding file gets picked up.

Thanks

The best way probably would be to structure the csv the same way the user might ask a question off the AI chatbot.

The second approach is valid as well, but you would need a sort of classifier first which could categorise the question and then run the embeddings match, though if your csv/ database isn’t too big, should not be a problem.

this is interesting. I’m not sure if I understand you fully.

I dont use question answer structure. Also I do make multiple columns, in contrast to two what most do. If using 3.5, the Ai is quite good at understanding your dataset. If you talking csv, how does your data structure look like? I made best results when “grouping” by row and always add attribute name to each data. Looks like: Title row, attribute: 1, attribute: 2, attribute: 3 - also i did this with question answers.
Also, if you want specific outputs, you can’t be specific enough while creating the prompt. Creating instructions can be very helpful for consistency. Also, (I think) prompts are very dataset specific. One application of LLM I did, therefore, uses an Input form, so that all users query the same, because the dataset has over 20’000 rows with 33 columns each. It is absolutely capable to structure and output the response in the format you want.

I’ve provided a snip of my csv embedding below. All columns present in the embedding are included in this snip. I thought the best approach would be to place columns that contained the most broad information at the left and get more specific the further right we go. I.e. Group, Sub-Group, Market Term, Description, Market Code (if applicable).

I then took this embedding structure and reformatted it to a question / answer format. To compare both and see what structure was optimal. I’ve observed similar results also in that unless the user prompt directly matches the question in the AI embedding, the answer generated wasn’t one that satisfied the question or prompt.

In terms of this user input form, does this generate a query based on one general prompt and then substitutes the values of the columns in your embedding? I.e. The context of this question surrounds $(group), the user is interested specifically in $(sub-group). The user wishes to know, $(question)

Translates to: The context of this question surrounds Bilaterals, the user is interested specifically in Metering. The user wishes to know what is a Verification of Supply Arrangements bilateral

I’ve used the values of the column headings in my snip to provide context. I’ve never considered an input form before. Very interesting

Hope this helps