Chat System message to not use public data only provided data

How do I write a system message that directs chatGPT to only use data provided by the user and not public chatGPT. This is for an application intended for use internally within a company using internal data.
This data is provided by the user and appended to the prompt.
Further, I want it to say “data is not available” if no data is provided following the system prompt. Right now, if no data is provided, it defaults to using public chatgpt training data.
I need to be able to lock it down.

Here is the current message I have tried

“Welcome! I am an assistant programmed to discuss the contents of Acme, Inc Documents, specifically from a Vector Database provided to me. To maintain data integrity, I will solely rely on the information available in this designated database. If, for any reason, the data is unavailable, I will inform you by saying ‘This service is currently unavailable.’ Please proceed with your queries about the provided documents, and let’s begin our conversation!”

That’s not really how these things work.

How much of “public” data do you not want it to use? Like, should it not know how to write sentences or what words mean? How do you expect it to decide what data to ignore?

LLMs do get hallucinations and sometimes come up with irrelevant info.

Guess using ChatGPT to help write system message for you can be a hallucination too. It can tell you it is possible to limit/filter its responses to be based on the context data provided by the user or from a vector database. Not a huge deal.

One useful thing to explore is this paper:
[2303.11315] Context-faithful Prompting for Large Language Models.

Essentially when you ask a question, that needs to be faithful to the context, it’s quite slippery to get the LLM to not draw on existing knowledge.

This paper asserts that by qualifying it as “Bob’s opinion” (no joke!) - and asking the LLM to answer the question “in Bob’s opinion”, you get a more context faithful answer.

Here’s a summary of the paper I generated with my auto-summariser:


The paper titled “Context-faithful Prompting for Large Language Models” focuses on improving the faithfulness of Large Language Models (LLMs) in context-specific Natural Language Processing (NLP) tasks. The authors identify two aspects where LLMs could improve: knowledge conflict and prediction with abstention. They propose two methods to enhance LLMs’ faithfulness: opinion-based prompts and counterfactual demonstrations.

Opinion-based prompts reframe the context as a narrator’s statement and inquire about the narrator’s opinions, forcing the model to pay more attention to the context. Counterfactual demonstrations use instances containing false facts to improve faithfulness in knowledge conflict situations. The authors conducted experiments on three datasets of two standard NLP tasks, machine reading comprehension and relation extraction, and found significant improvements in faithfulness to contexts.

Prompt Suggestions:

  1. Opinion-based Prompt:

    • Context: “Bob said, ‘{context}’”
    • Question: “What is the summary of the document according to Bob’s statement?”
  2. Attributed Prompt:

    • Context: “{context}”
    • Question: “Can you summarize the document based on the given text?”
  3. Instruction-based Prompt:

    • Instruction: “Read the given information carefully and provide a summary.”
    • Context: “{context}”
    • Question: “What is the summary of the document?”
  4. Opinion + Instruction-based Prompt:

    • Instruction: “Based on Bob’s statement, provide a summary.”
    • Context: “Bob said, ‘{context}’”
    • Question: “What is the summary of the document?”

Note: Replace ‘{context}’ with the actual text to be summarized.

have you experimented with different structures of your dataset? Hallucinations mostly occur, if the relevant information hasn’t scored high enough.
I had the same problem and could manage it by restructuring my data.

While working on it i also had the idea to embed fake data, which scores lower and is labeled as “No Information”, so that this would get responded whenever no real data is found. Never tried though.