@sergeliatko
Thank you for the reply. I am testing the use case that user ask about the company and the bot replays about the description of the company.
My training data(fact) looks like this. ABC, BCD are the company names and it has huge amount.
# Training Data
{"prompt":"Tell me about ABC ->","completion":" ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles.\n"}
{"prompt":"Tell me about BCD ->","completion":" BCD belongs to CyberSecurity. HQ is in ISR. Their business is related to Consumer Electronics,Hardware,Information Technology,Privacy and Security,Software. BCD is a breach and attack simulation platform that helps organizations verify their security posture.\n"}
Firstly, I tried to use about 3K+ above sample data for fine-tune model which didn’t work well with curie, davinci as even asking the same prompt, it returns nonfactual response.
Secondary, I use embedding API to calculate similarity between the sample prompts and user’s input. Then provide top 3 similarity prompt and completion from training data. This is working fine so far. The steps are like this below.
1: Pre process training data in CSV to calculate embedding API. The columns of the file looks like this.
prompt,completion,babbage_similarity,babbage_search,...<Additional Columns>
2: When user asks, take the input string to calculate similarity against the pre-processed data , then select the top3 similar ones with prompt and completion for providing them to completion API request.
3: Build completion API request. The structure of the request content is :
<Prefix String>
<3 Training Context selected from Embedding API similarity against user's input>
<User's Input>
The actual request looks like this below. Parts surrounded by <> are not included in actual requests.
<Prefix String>
The following is a conversation with an AI assistant called BOT. BOT is helpful, creative, clever, and very friendly. If you ask BOT a question that is rooted in truth, BOT will give you the answer. If you ask BOT a question that is nonsense, trickery, or has no clear answer, I will respond with "Sorry, I am not sure. I will learn more.".\n\n
<3 Training Context selected from Embedding API similarity against user's input>
User: Tell me about ABC ->
BOT: ABC belongs to Web3. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. ABC is a blockchain technology company that develops NFTs and digital collectibles. ###
User: Tell me about BC ->
BOT: BC belongs to FinTech. HQ is in USA. Their business is related to Financial Services,Media and Entertainment,Other,Payments,Software. BC is a payment technology company that develops banking solutions. ###
User: Tell me about valuation of ABC ->
BOT: ABC has valuation of 100M$ ###
<User's Input>
User: Tell me about ABC ->
So, using prompt for embedding is to select what to include in the completion API request as training data. I wonder what would work better. If you have any suggestions, I really appreciate it.