Thanks a lot for those insights. I think I am starting to get the general idea.
So I have this case that seem to be a multi-shot case but that probably has some more subtle angles to consider. Basically, I want to build a sort of system that is specialized within the data privacy compliance area. So we are talking about a niche with a lot of pretty specific & large amount / diverse amount of information & knowledge to retrieve.
Here is my goal: Based on a single question, I want to be able to retrieve the most relevant information that could help building knowledge to answer to the question. It’s not about answering the question directly and it’s not about retrieving information in bulk. It’s rather about highlighting the most relevant piece of information and making it easy to understand, concise & actionable to help for future decision making.
And I think I will essentially have 3 types of questions
[1] The simple one – The question goes straight to the point about a very specific topic
Blockquote
Question: Considering GDPR law, where am I allowed to store data collected?
Blockquote
Answer: GDPR states that data should essentially be stored in the EU. Stored data can be transferred to another region as long as the destination country enforces data privacy rules that are equivalent to GDPR. Note that transferring data to the US might be in breach of GDPR due to the recent eprivacy developments that exposed a risk regarding data anonymization & access.
[2] The open one – The question requires combining several topics & ask for an opinion / outlook
Blockquote
Question: How did GDPR impact digital marketing and how will it impact it in the near future?
Blockquote
Answer: The gdpr has significant implications on online lead generation. The biggest impact was on ethical data acquisition that requires businesses to have a clear basis for data collection such as consent or legitimate interest. This ensures that leads are genuinely interested in your business thereby improving the effectiveness of your marketing efforts. Furthermore, other requirements have emerged: Companies can no longer store data indefinitely unless they have a legitimate purpose, they also have to justify & keep track of all data processing in place, etc. Within the near future, it is expected that laws such as GDPR will continue to apply constraints to companies operating digital marketing activities while users’ protection should improve over time. Future regulations will most likely be influenced by the upcoming judgments from the different local data authorities that are assessing major tech. companies compliance such as Meta, Facebook & Amazon.
[3] The decision-making one – The question asks for a recommendation / what to do in a certain situation
Blockquote
Question: Is it more strategic to focus on gdpr-compliant user consent collection or on secure data storage for now?
Blockquote
Answer: When it comes to GDPR compliance, it is difficult to prioritize consent collection versus data storage as both are mandatory element to compliance. A valid consent is the first critical & mandatory step before collecting & processing personal data. Without it, anyone collecting data will be automatically in breach. Once the data is collected, it should be stored in a secure manner which involves data encryption, anonymization, etc. Even though it is not advised, we could assume that securing data could be done afterwards in case the company needs a delay to build up its securitization capabilities.
How do I deal with this use case right now?
I approached those questions in a very “traditional” way:
- I have a large text file (50k words → It will probably grow up to 150k words in the future) that contains all the data privacy knowledge.
- Based on the question asked, I run a semantic search to retrieve relevant information from my document (Ada-002 embedding + Chroma + Similarity x MMR search)
- Then I build a few prompts in which I inject the semantic search result as a context and I refine the retrieved information using GPT4 to exclude any irrelevant information / rephrase the text to make it easy to understand.
What I like with this approach is the ability to retrieve a good diversity of information from my document. Diversity is important because some questions (such as [2] & [3]) involve dealing with cross-topic knowledge.
What I don’t like is the fact that I can sometime get “false positive” / unrelated information. And I can also end up retrieving information that is a bit vague or actually not entirely applicable to the question → The more complex or large the question is, the bigger the issue.
What I also don’t like is that I need several prompts to reach a decent result (which can quickly become expensive).
Could multi-shot work be an enhancement?
As far as I understand, the approach above is not multi-shot per say but rather context stuffing. To do a real multi-shot, I should come up with a pairs of question x answer.
But I am not sure about the benefits of shifting from context stuffing to multi shot:
- Can this really bring better accuracy & value compared to context stuffing?
- If I want to cover the complexity of data privacy, I assume I’ll have to come up with an very large amount of questions. In my current document, what used to be a single paragraph regarding a certain topic will suddenly become a list of 10-20 questions x answers to make sure I cover aspect of the same topic.
- Can a multi-shot maintain the right level of information diversity? Some questions are broad or complex ; they require retrieving information from several topics. Intuitively, I would assume that the question x answer pair would narrow the prompt intent and make it less capable of assembling diverse information
So my questions are this point are:
Can we consider such situation as pattern limit due to the complexity described?
If I make the effort of building so many questions-answers pairs, what’s the point in keeping a multi-shot approach rather than feeding a fined-tuned gpt3.5?
Thanks a lot again for your help