Created custom model with Prompt and completion but result are not satisfying

Below is my scenario:

Attached my Prompt and completion file used this file to train the model and model training is successful. When I try to use playground am not getting expected result model is hallucinating.

Please help me with below query:

  1. Whether my Prompt and completion file looks fine
  2. Model trained successfully how to test in playground means how to ask perfect question
  3. Provide any help on fine tuning and getting the proper results.

Not sure not able to attach the file:
{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=58| ASSIGNMENT_NUMBER =E58| EMPLOYEE_NAME =Clare Paul, Angela| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =2020/03/11| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 58 2. Employee Name= Clare Paul, Angela 3. Assignment Number= E58 4. Business Unit= Powered US 5. Legal Entity= Powered Enterprise USA 6. Employee Status= ACTIVE 7. Hire Date= 2020/03/11”}
{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=68| ASSIGNMENT_NUMBER =E68| EMPLOYEE_NAME =Dave, Murry| ACTION_CODE =LOCATION_CHANGE| EFFECTIVE_START_DATE =2021/08/25| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=2| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =US Manufacturing Unit 01| BUSINESS_UNIT =US MFG BU1| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 68 2. Employee Name= Dave, Murry 3. Assignment Number= E68 4. Business Unit= US MFG BU1 5. Legal Entity= US Manufacturing Unit 01 6. Employee Status= ACTIVE 7. Hire Date= 2021/08/25”}
{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=95| ASSIGNMENT_NUMBER =E95| EMPLOYEE_NAME =Requisitioner.US1| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =1959/01/01| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 95 2. Employee Name= Requisitioner.US1 3. Assignment Number= E95 4. Business Unit= Powered US 5. Legal Entity= Powered Enterprise USA 6. Employee Status= ACTIVE 7. Hire Date= 1959/01/01”}

I have 800 example in my file.Please help here

Thanks,
Sheik

Welcome to the forum.

Fine-tuning can be a bit tricky. What did your dataset look like? How many examples? More info is helpful.

What is the main goal you’re trying to achieve. Are you sure fine-tuning and not RAG is the appropriate solution for you?

{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=58| ASSIGNMENT_NUMBER =E58| EMPLOYEE_NAME =Clare Paul, Angela| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =2020/03/11| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 58 2. Employee Name= Clare Paul, Angela 3. Assignment Number= E58 4. Business Unit= Powered US 5. Legal Entity= Powered Enterprise USA 6. Employee Status= ACTIVE 7. Hire Date= 2020/03/11”}
{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=68| ASSIGNMENT_NUMBER =E68| EMPLOYEE_NAME =Dave, Murry| ACTION_CODE =LOCATION_CHANGE| EFFECTIVE_START_DATE =2021/08/25| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=2| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =US Manufacturing Unit 01| BUSINESS_UNIT =US MFG BU1| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 68 2. Employee Name= Dave, Murry 3. Assignment Number= E68 4. Business Unit= US MFG BU1 5. Legal Entity= US Manufacturing Unit 01 6. Employee Status= ACTIVE 7. Hire Date= 2021/08/25”}
{“prompt”: “Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:\nPERSON_NUMBER=95| ASSIGNMENT_NUMBER =E95| EMPLOYEE_NAME =Requisitioner.US1| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =1959/01/01| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS\n\n\nThe Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in YYYY/MM/DD Format \nPresent the details as a numbered list\nIgnore specical characters in your response\nFocus on columns you think are relevant, and the relationships between them\n\nDo not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid\nExtract the details about the employee based on user input”, “completion”: “The Employee details are\n 1. Person Number= 95 2. Employee Name= Requisitioner.US1 3. Assignment Number= E95 4. Business Unit= Powered US 5. Legal Entity= Powered Enterprise USA 6. Employee Status= ACTIVE 7. Hire Date= 1959/01/01”}

This is my Json file have 800 unique Prompt but not getting expected results

My expectation is to get the employee details and expecting model to understand NLP and provide apt response

Here’s some advice from our pal GPT-4…

Based on the JSON file you’ve shared, it seems you are working with a dataset for training a model to extract specific employee details from a structured data format. Here are some insights and suggestions to address your concerns:

Assessing Your JSON File for Training

Structure and Formatting:

Your JSON structure appears correct for a training dataset. Each entry contains a prompt and a corresponding completion.
Ensure that all 800 entries maintain this consistent format.

Data Quality:
It’s crucial that the data accurately represents the type of queries and responses you expect in real-world scenarios.
The completions should be precise, following the guidelines set in the prompt.

Variety and Coverage:
Ensure that your dataset covers a wide range of possible queries and variations in data formatting.
Including a diverse set of examples helps the model generalize better.

Testing the Model in Playground

Consistency with Training Data:
When testing in the playground, your queries should be similar in structure and context to the training data.
For example, if your training data is based on extracting employee details, your test queries should also be about extracting similar information.

Specificity in Queries:
Be specific in your queries. The more your query resembles the training prompts, the better the model’s response will likely be.

Testing with Variations:
Test with variations in your queries to see how well the model generalizes the knowledge it gained from training.
Include edge cases or slightly different formats from what’s in your training data.

Addressing Hallucination in Model Responses

Refine Training Data:
If the model is hallucinating (providing inaccurate or fictional information), this might be due to ambiguities or inconsistencies in the training data.
Review and refine your training dataset for clarity and consistency.

Incremental Training:
Consider incremental training where you initially use a subset of your data, test the model, and then gradually add more data, observing how the model’s performance changes.

Feedback Loop:

  • Use the model’s incorrect responses as feedback to further refine your training data.

Fine-tuning and Improving Results

Balanced Data:

  • Ensure your dataset is balanced in terms of the types of queries and responses.
  • Avoid overrepresentation of certain patterns which might lead the model to develop biases.

Quality Over Quantity:

  • While having 800 unique prompts is good, the quality and diversity of these prompts are more important than the quantity.

Continuous Evaluation:

  • Regularly evaluate the model’s performance with new, unseen data to gauge its real-world applicability.

Documentation and Community Resources:

  • Refer to the documentation of the tool or platform you’re using for model training. They often provide best practices for training data preparation and model evaluation.
  • Engage with community forums or support channels for specific advice related to the tool you are using.

Remember, fine-tuning a model is an iterative process. It often requires multiple rounds of training, testing, and refining to achieve the desired results. Keep experimenting with different approaches and track the changes in performance to understand what works best for your specific use case.

If you can’t get it working, you might look into RAG, retrieval augmented generation…

Thanks for the fantastic response:

I am in a situation - based on the prompt what is the exact type of question I need to ask.

Please provide some examples:

I tried : Please provide the details of the employee with assignment number E324 and its continuously hallucinating.

Had small prompt with 5 records and it was working perfectly.

1 Like

I understand the challenge you’re facing. When dealing with a model trained on specific prompts and completions, it’s crucial that the input query closely aligns with the structure and style of the training data. This ensures that the model can accurately recognize and respond to the query based on its training.

Given your scenario, let’s refine the way you frame your queries to match the training data. Here are some examples based on the prompts you provided:

Original Training Prompt Structure

Your training data follows a structure where a detailed string of employee data is provided, and the model is expected to extract specific information. For example:

PERSON_NUMBER=58| ASSIGNMENT_NUMBER =E58| EMPLOYEE_NAME =Clare Paul, Angela| ...

Suggested Query Formats

  1. Direct Reference to Data Structure:

    • “Extract employee details for the following data: PERSON_NUMBER=324| ASSIGNMENT_NUMBER =E324| EMPLOYEE_NAME =John Doe| …”
    • This query mimics the structure of your training data, which might help the model to understand and process it correctly.
  2. Structured Query Reflecting Training Format:

    • “Given the dataset entry: PERSON_NUMBER=324| ASSIGNMENT_NUMBER =E324| EMPLOYEE_NAME =John Doe| …, extract the employee’s name, assignment number, and person number.”
    • Here, you’re providing a snippet of data and asking for specific details, similar to your training setup.
  3. Question with Embedded Data Format:

    • “From the dataset entry ‘PERSON_NUMBER=324| ASSIGNMENT_NUMBER =E324| EMPLOYEE_NAME =John Doe| …’, what is the employee’s name and assignment number?”
    • This format directly incorporates the data string into the question.

Tips for Refining Queries

  • Mimic Training Data: Ensure your queries closely resemble the structure and content of your training prompts.
  • Clarity and Specificity: Be clear and specific about what information you need the model to extract.
  • Include Context: If your training data includes a specific introduction or context, include this in your queries too.

Addressing Hallucination

  • If the model continues to hallucinate despite well-structured queries, it might indicate a need to retrain the model with more varied or specific examples.
  • Including examples in your training data where the model has to identify and ignore irrelevant information might help reduce hallucination.

Remember, the key is to make your test queries as similar as possible to the training examples. The more your queries deviate from the training format, the higher the chance of the model not understanding or misinterpreting the request.

Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:
PERSONNUMBER = 58|ASSIGNMENTNUMBER = E58|EMPLOYEE_NAME = Clare Paul Angela|ACTIONCODE = HIRE|EFFECTIVESTARTDATE = 2020/03/11|EFFECTIVEENDDATE = 4712/12/31|EFFECTIVESEQUENCE=1|EMPLOYEESTATUS = ACTIVE|LEGALENTITY = Powered Enterprise|USA BUSINESSUNIT = Powered US|ASGSTATUSCODE = ACTIVE_PROCESS
Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:
PERSONNUMBER= 360|ASSIGNMENTNUMBER = E360|EMPLOYEE_NAME = USGOV-AP.Supervison|ACTIONCODE = HIRE|EFFECTIVESTARTDATE = 1959/01/01|EFFECTIVEENDDATE = 4712/12/31|EFFECTIVESEQUENCE=1|EMPLOYEESTATUS = ACTIVE|LEGALENTITY = Powered Enterprise US Gov|BUSINESSUNIT = PE US GOV Health and Human Services BU|ASGSTATUSCODE = ACTIVE_PROCESS
Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:
PERSONNUMBER= 52|ASSIGNMENTNUMBER = E52|EMPLOYEE_NAME = Nanthakumar Murugesan|ACTIONCODE = HIRE|EFFECTIVESTARTDATE = 2020/03/11| EFFECTIVEENDDATE = 4712/12/31|EFFECTIVESEQUENCE=1|EMPLOYEESTATUS = ACTIVE|LEGALENTITY = Powered Enterprise USA|BUSINESSUNIT = Powered US|ASGSTATUSCODE = ACTIVE_PROCESS
Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:
PERSON_NUMBER=218| ASSIGNMENT_NUMBER =E218| EMPLOYEE_NAME =Pricing.Manager| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =1952/01/01| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS
Assume the role of a Human Resource data analyst. Given a dataset, your job is to extract the employee details from the data below:
PERSON_NUMBER=324| ASSIGNMENT_NUMBER =E324| EMPLOYEE_NAME =Bajpai, Abhishek| ACTION_CODE =HIRE| EFFECTIVE_START_DATE =2020/09/07| EFFECTIVE_END_DATE =4712/12/31| EFFECTIVE_SEQUENCE=1| EMPLOYEE_STATUS =ACTIVE| LEGAL_ENTITY =Powered Enterprise USA| BUSINESS_UNIT =Powered US| STATUS=ACTIVE_PROCESS

  • The Detail should contain Person Number,Employee Name, Assignment Number, Business Unit and Legal Entity and Employee Status with Hire Date in DD MMM YYYY Format
  • Present the details as a numbered list
  • Ignore specical characters in your response
  • Focus on columns you think are relevant, and the relationships between them
  • Do not make assumptions about any information not provided in the data. If the information is not in the data, any insight derived from it is invalid

:::::: This prompt is working perfectly for any type of question I ask - this is dummy data.

How long it will take to get the proper results?
How do get confirmation model is working fine?

But how BIG is your 800 dataset. That is the biggest limitation. If its bigger than 32k you’ll have the problem that it cannot process the whole set as ‘one’ so it will be able to ‘search’ but will have a hard time coming up with aggregate results or combining random records.

Thanks for the response it has 800 Prompt and completion and file size is 940 KB

So consider creating an assistant that can do this work for a single record? And then feed the assistant one record at a time. (Or block of 10).
With the current 32k models you will never be able to let the model work directly on a 1mb file, UNLESS it is for the purpose of searching / quoting from it.

1 Like

You mean to say club the records in the single prompt.

Model requires 32 Prompt completion as pre requisite

I would say create an Assistant (much easier to ‘train’ by simply replacing the assistant prompt) and teach it how to process a single record or a bunch of records depending on what you can easily ‘feed it’

Really not sure how to proceed on this - I am not able to understand the Assistant pattern

Start in the sandbox. Create an Assistant in the OpenAI backend. THen play with it in the sandbox.