How to stop models returning "preachy" conclusions

Diet · January 23, 2024, 10:46am

aaaah I injected ~100_000 tokens

I added the list thing because I noticed that the answers would be a list of facts, followed by a summary, followed by a disclaimer (sometimes).

the idea is that if there is a dramatic change in pattern, it needs to pay attention to the pattern instructions (schema). by telling it to list the facts, it has to stop the list because the summary wouldn’t be part of that list. the list pattern breaks, so the schema needs to be consulted. otherwise the model can just get carried away.

I’m wondering if it’s possible to combine this mechanistic approach with @DevGirl’s roleplay approach.

Naive attempt with interesting failure mode

I adapted the schema slightly: I took out the list requirement, and included the big fat warning block

Start your answer with 
<notes>
  <!-- any initial notes before giving an answer -->
</notes>
<answer>
  <!-- WARNING: ANY SUMMARY IN THE ANSWER BLOCK WILL TRIGGER IMMEDIATE FAILURE -->
</answer>
</summary>
<summary>
<disclaimer>
  <!-- optional -->
</disclaimer>

However, the model still accidentally started summarizing, but then borked the summary tags

output1

<notes>
  <!-- The user is asking about the legality of using copyrighted material to create a service that answers questions about the material. They are also asking about the implications of using an LLM to automate this process. -->
</notes>
<answer>
The scenario you've described involves several steps where copyright law could potentially be implicated. Initially, purchasing a copy of a novel like Stephen King's latest work and using it for personal enjoyment is typically within the rights granted to you by the copyright owner. However, when you begin to scan and chunk the text to create an embedding in your vector store, you are creating a derivative work, which is one of the exclusive rights held by the copyright owner under Title 17 of the U.S. Code, specifically in sections 106 and 106A.

Creating a website that answers questions about the novel could be seen as transformative and might fall under fair use, depending on factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work, as outlined in section 107. However, this is a complex determination that would ultimately need to be made by a court.

When you transition to using an LLM to answer questions directly, if the LLM's responses include excerpts from the novel, this could be considered copyright infringement, especially if the excerpts are substantial and not just brief quotes used for commentary, criticism, or educational purposes. Even if the LLM does not return any of Mr. King's text verbatim, if the output is based on a substantial part of the copyrighted text, it could still be infringing because it derives from the copyrighted work.

In both cases, the key issue is whether the use of the copyrighted material falls under fair use, which is a defense against copyright infringement. The fair use doctrine is purposefully broad and flexible, and each case is unique. Therefore, it's advisable to consult with a copyright attorney to evaluate the specific circumstances and risks associated with your intended use of copyrighted material.
</answer>
</summary>
<disclaimer>
  <!-- optional -->
  This response is not legal advice and should not be taken as such. For legal advice, please consult a qualified attorney.
</disclaimer>

output2

<notes>
  <!-- The user is asking about the legality of using copyrighted material to create a service that answers questions about the material. They are also inquiring about the implications of using an LLM to automate this process. -->
</notes>
<answer>
The scenario you've described involves several steps where copyright law could potentially be implicated. Initially, purchasing a copy of a novel like Stephen King's latest work gives you the right to own and read that copy, but it does not give you the right to reproduce, distribute, or create derivative works based on the copyrighted material without permission from the copyright holder.

When you cut up the book, scan it, and create an embedding in your vector store, you are creating a reproduction of the copyrighted work. This act, even if not shared or distributed, could be seen as a violation of the copyright holder's exclusive rights under Title 17 of the U.S. Code, specifically sections 106(1) and 106(2), which cover the rights to reproduce the copyrighted work and to prepare derivative works based upon the copyrighted work.

Creating a website that answers questions about the novel may fall under fair use, depending on how the information is used and presented. However, the use of embeddings from your vector store to find answers introduces a level of complexity, as it involves reproducing the copyrighted work in a digital format. If the embeddings are substantial enough to represent a significant portion of the copyrighted work, this could be a violation of the copyright holder's exclusive rights.

Allowing an LLM to access these embeddings and generate responses could further complicate the matter. If the LLM's responses include excerpts from the novel, this could potentially infringe on the copyright holder's exclusive rights to reproduce and distribute the copyrighted work, as outlined in sections 106(1) and 106(3). Even if the LLM does not return any direct text from the novel, the process still relies on reproductions of the copyrighted work, which could be problematic.

In summary, the creation of the vector store from the scanned text of the copyrighted novel, the use of that store to generate responses to user queries, and the automation of this process using an LLM could all potentially violate copyright law, as they may infringe upon the copyright holder's exclusive rights to reproduce, create derivative works, and distribute the copyrighted material.
</answer>
</summary>
<disclaimer>
  <!-- optional -->
  This response is not legal advice and should not be taken as a definitive interpretation of the law. For legal advice on specific copyright issues, you should consult a qualified attorney.
</disclaimer>

It was kind of suspicious that this was a repeating pattern.

and then I realized that I’d borked the schema from the start! I had the summary tags backwards.

a more promising attempt:

slightly adjusted the schema

Present your answer with the following schema:
<notes>
  // any initial notes before giving an answer
</notes>
<answer>
  // the actual answer. do not include a summary here.
  // WARNING: ANY SUMMARY IN THE ANSWER BLOCK WILL TRIGGER IMMEDIATE SYSTEM FAILURE
</answer>
<summary>
  // optional summary goes here
</summary>
<disclaimer>
  // optional if you have any disclaimers
</disclaimer>

output looks promising:

notes

The user is asking about the legality of using copyrighted material to create a database for a question-answering service, and whether the involvement of an LLM changes the copyright implications.

answer

The scenario you’ve described involves several steps where copyright law could be implicated. Initially, purchasing a copy of a novel and using it privately is typically within your rights. However, scanning and creating an embedding of the text in a vector store, even without distributing the text itself, could potentially be considered an infringement if it involves creating a derivative work or reproducing the copyrighted material in a fixed form that goes beyond what is allowed under fair use.

When you create a website to answer questions about the novel, if you are using the embeddings to find answers and potentially providing excerpts from the novel as part of those answers, this could also raise copyright concerns. The fair use doctrine might allow for limited quoting of copyrighted material for purposes such as criticism, comment, news reporting, teaching, scholarship, or research, but this is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.

Introducing an LLM to interact with the vector store and provide responses does not inherently remove the copyright issues. If the LLM’s responses include excerpts from the copyrighted text, this could still be considered copyright infringement, depending on the extent and substantiality of the excerpts used and whether such use could be defended under fair use.

In both cases, the key issue is whether the use of the copyrighted material falls within the scope of fair use, which is not a straightforward determination and often requires a nuanced legal analysis. Additionally, the fact that the novel’s text is being used to generate responses to user queries on a website could be seen as a public distribution of the copyrighted work, which is one of the exclusive rights held by the copyright owner.

summary snipped

raw:

out 1

<notes>
The user is asking about the legality of using copyrighted material to create a database for a question-answering service, and whether the involvement of an LLM changes the copyright implications.
</notes>
<answer>
The scenario you've described involves several steps where copyright law could be implicated. Initially, purchasing a copy of a novel and using it privately is typically within your rights. However, scanning and creating an embedding of the text in a vector store, even without distributing the text itself, could potentially be considered an infringement if it involves creating a derivative work or reproducing the copyrighted material in a fixed form that goes beyond what is allowed under fair use.

When you create a website to answer questions about the novel, if you are using the embeddings to find answers and potentially providing excerpts from the novel as part of those answers, this could also raise copyright concerns. The fair use doctrine might allow for limited quoting of copyrighted material for purposes such as criticism, comment, news reporting, teaching, scholarship, or research, but this is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.

Introducing an LLM to interact with the vector store and provide responses does not inherently remove the copyright issues. If the LLM's responses include excerpts from the copyrighted text, this could still be considered copyright infringement, depending on the extent and substantiality of the excerpts used and whether such use could be defended under fair use.

In both cases, the key issue is whether the use of the copyrighted material falls within the scope of fair use, which is not a straightforward determination and often requires a nuanced legal analysis. Additionally, the fact that the novel's text is being used to generate responses to user queries on a website could be seen as a public distribution of the copyrighted work, which is one of the exclusive rights held by the copyright owner.
</answer>
<summary>
The legality of your actions under copyright law would depend on whether they qualify as fair use, which is determined by analyzing several factors. Both creating an embedding of the text and using an LLM to generate responses that may include excerpts from the copyrighted work could potentially constitute copyright infringement.
</summary>
<disclaimer>
This information is not legal advice and should not be taken as such. For legal advice regarding specific situations, consult with a qualified attorney.
</disclaimer>

out 2

<notes>
The user's scenario involves several steps that potentially implicate copyright law, including the reproduction of a copyrighted work, the creation of a derivative work, and the public display or distribution of that work. The scenario evolves from personal use to a more public and potentially commercial use, which raises additional copyright concerns.
</notes>
<answer>
Based on the information provided, there are several points in your scenario where copyright law could potentially be violated:

1. Scanning and Chunking the Text: The act of scanning and creating a digital copy of the entire novel constitutes reproduction of the copyrighted work. Even if this is done for personal use, it is not a clearly exempted use under copyright law, and the copyright owner has the exclusive right to reproduce the work.

2. Creating an Embedding in a Vector Store: Transforming the text into an embedding and storing it in a vector store could be considered the creation of a derivative work, which is also exclusively the right of the copyright owner.

3. Operating a Website to Answer Questions: If the website is operated in a way that involves public display or distribution of copyrighted material, this could infringe on the copyright owner's exclusive rights, especially if the operation is commercial in nature.

4. Using an LLM to Answer Questions: If the LLM uses the copyrighted text to generate answers that include excerpts from the novel, this could be considered an infringement, depending on the nature, amount, and substantiality of the excerpts used. The use of copyrighted material to generate responses, even if not verbatim or extensive, could still infringe on the author's derivative works right.

It is important to note that fair use could potentially apply to some of these uses, particularly if the use is transformative, non-commercial, does not affect the market for the original work, and only uses a small portion of the text. However, fair use is a complex defense that depends on a case-by-case analysis, and there is no guarantee that a court would find a particular use to be fair.
</answer>
<summary>
In summary, potential copyright violations in your scenario include unauthorized reproduction, creation of a derivative work, and public display or distribution of copyrighted material. The use of an LLM to generate responses based on copyrighted text also raises concerns, particularly if the output includes excerpts from the copyrighted work.
</summary>
<disclaimer>
This response is for informational purposes only and does not constitute legal advice. Copyright law can be complex and fact-specific, and it is advisable to consult with a copyright attorney to assess legal risks and obtain advice tailored to your specific circumstances.
</disclaimer>

Overall, devgirl’s approach, if it works for you, might be easier to maintain in the long run.

Sorry for using your thread as my experimentation playground

Topic		Replies	Views
How to prevent ChatGPT from answering questions that are outside the scope of the provided context in the SYSTEM role message? API	53	176126	December 2, 2023
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48195	December 12, 2023
Train (fine-tune) a model with text from books or articles API	62	27724	November 30, 2023
My GPT with Over 5000 Chats Removed from GPT Store Without Explanation: 1.5 Months of Development and Promotion Wasted Plugins / Actions builders custom-gpt , gpts , gpt , gpt-store , custom-gpts	58	5330	January 19, 2024
The length of the embedding contents API	48	33582	December 13, 2023

How to stop models returning "preachy" conclusions

notes

answer

summary snipped

Related topics