You’right, and I learned some stuff; it’s not that straight forward for extremely long prompts. i changed it to copyright because i had usc17 at hand.
sys_prompt = f"""
You are an expert and friendly copyright advisor. You are tasked with answering any question about copyright and offering expert advice. Do not discuss other topics. Stick to answering using the context you are provided.
etc, etc...
Anything between the following `context` html blocks is retrieved from a knowledge bank and should be used as a basis for your answer:
<context>
{copyright_raw_strings[0:300000]}
</context>
remember: the user question can only pertain to the text in the context. if it doesn't, refuse to answer.
---
---
------
Start your answer with
<notes>
<!-- any initial notes before giving an answer -->
</notes>
<answer>
<!-- only core facts, as a list -->
</answer>
</summary>
<!-- when summarizing, wrap it in this -->
<summary>
<disclaimer>
<!-- optional -->
</disclaimer>
"""
in this particular case, <!-- only core facts, as a list -->
, “as a list” seems to be critically important. otherwise, the summary will often slip into the answer.
I used @SomebodySysop 's question about copyright as a test
These are the first 5 answers I got from the prompt style above:
unstopped responses
<notes>
<!-- The user is asking about the legality of using copyrighted material in a specific way, which involves creating a vector store from a copyrighted novel and using an LLM to answer questions about the novel. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Copyright protection subsists in original works of authorship fixed in any tangible medium of expression.
- The exclusive rights of the copyright owner include the right to reproduce, distribute, and prepare derivative works.
- Fair use is a limitation on copyright that allows for use of copyrighted material without permission for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.
- Fair use is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the copyrighted work.
- Creating an embedding of a copyrighted work in a vector store could potentially be seen as creating a derivative work or reproduction.
- Using an LLM to generate responses based on copyrighted material could potentially involve reproduction or the creation of derivative works.
- Whether such use constitutes fair use or infringement would depend on the application of fair use factors.
</answer>
<summary>
<!-- when summarizing, wrap it in this -->
You've described a process that involves several steps where copyright law could potentially be implicated. The creation of a vector store from a copyrighted novel and the use of an LLM to answer questions based on that novel involve the reproduction and creation of derivative works, which are exclusive rights of the copyright owner. However, whether your actions constitute fair use or infringement would depend on a detailed analysis of the fair use factors, including the purpose of your use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of your use on the market for the copyrighted work. It's important to note that fair use is determined on a case-by-case basis, and the outcome can vary depending on the specific circumstances.
</summary>
<disclaimer>
<!-- optional -->
This information is not legal advice and should not be taken as such. For legal advice regarding specific situations, consult a qualified attorney.
</disclaimer>
<notes>
<!-- The user is asking about the legality of using copyrighted material to create a database for a question-answering service and the further use of an LLM to provide answers that may contain excerpts from the copyrighted work. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Copyright protection subsists in original works of authorship fixed in any tangible medium of expression.
- The owner of copyright under this title has the exclusive rights to do and to authorize any of the following: reproduction, preparation of derivative works, distribution, and public display of the copyrighted work.
- The fair use doctrine allows for limited use of copyrighted material without permission from the copyright holder, typically for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.
- Fair use is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.
- Creating a vector store by scanning and chunking text from a copyrighted novel may be considered reproduction and creation of a derivative work.
- Using an LLM to answer questions based on the copyrighted content may involve reproduction, preparation of derivative works, and distribution.
- The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.
</answer>
<summary>
<!-- when summarizing, wrap it in this -->
You may have violated copyright law by reproducing and creating a derivative work of the copyrighted novel when you scanned, chunked, and embedded the text into your vector store. Additionally, using an LLM to answer questions directly, which may include excerpts from the copyrighted work, could also be considered a violation as it involves reproduction, preparation of derivative works, and distribution. However, whether these actions fall under fair use would depend on a case-by-case analysis based on the four factors of fair use.
</summary>
<disclaimer>
<!-- optional -->
This information is not legal advice, and if you need assistance with a copyright issue, you should consult a qualified attorney.
</disclaimer>
<notes>
<!-- The user is asking about the legality of using copyrighted material to create a database for a question-answering service, and whether the involvement of an LLM changes the copyright implications. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Title 17 of the United States Code governs copyright law.
- Copyright protection subsists in "original works of authorship" that are fixed in a tangible form of expression.
- The copyright owner has exclusive rights under section 106, which include the right to reproduce, distribute, and display the work.
- Fair use is a limitation on copyright under section 107, which allows for uses such as criticism, comment, news reporting, teaching, scholarship, or research.
- The fair use doctrine requires consideration of factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.
- Creating a derivative work, such as a database or vector store, based on copyrighted material may infringe on the copyright owner's exclusive rights if not done within the bounds of fair use.
- The use of an LLM to generate responses based on copyrighted material does not necessarily exempt one from copyright infringement considerations.
</answer>
<summary>
You may have violated copyright law by reproducing Stephen King's novel to create a database or vector store without permission, as this could infringe on the exclusive rights of reproduction and creation of derivative works held by the copyright owner. The use of an LLM to answer questions based on this copyrighted material could also be problematic if the responses are not clearly within the bounds of fair use. The legality of these actions would depend on a fair use analysis, which considers various factors including the purpose of the use and its impact on the market for the original work.
</summary>
<disclaimer>
<!-- optional -->
This is a general interpretation of copyright law and should not be taken as legal advice. For specific legal inquiries, it is recommended to consult with a copyright attorney.
</disclaimer>
<notes>
<!-- The user is asking about the legality of using copyrighted material in a specific way, which involves creating a vector store from a copyrighted novel, answering questions on a website, and potentially using an LLM to automate the process. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Copyright protection subsists in original works of authorship fixed in any tangible medium of expression.
- The owner of copyright under this title has the exclusive rights to do and to authorize reproduction, preparation of derivative works, distribution, and public display of the copyrighted work.
- The fair use doctrine allows limited use of copyrighted material without permission from the copyright holder, typically for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.
- Fair use is determined on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.
- The creation of a derivative work based on copyrighted material typically requires permission from the copyright holder.
- The distribution of copyrighted material, even if not for direct or indirect commercial advantage, may infringe on the copyright holder's exclusive rights if not done within the bounds of fair use.
</answer>
<summary>
You may have violated copyright law in the following ways:
- By scanning and storing the text of the copyrighted novel in your vector store, you may have created an unauthorized derivative work or reproduction.
- Using the embeddings from your vector store to answer questions on your website could be considered an infringement if it involves reproducing substantial parts of the copyrighted text, even if indirectly.
- Allowing an LLM to access the vector store and generate responses that may contain excerpts from the copyrighted work could also potentially infringe on the copyright holder's exclusive rights, depending on whether the use qualifies as fair use.
It's important to note that fair use determinations are complex and fact-specific, and the scenarios you described could be subject to legal scrutiny based on the specific details of how the copyrighted material is used.
</summary>
<disclaimer>
<!-- optional -->
This information is not legal advice and is for educational purposes only. For legal advice regarding specific situations, consult with a qualified attorney.
</disclaimer>
<notes>
<!-- The user is asking about the legality of using copyrighted material in a specific way, which involves creating a vector store from a copyrighted novel, answering questions on a website, and potentially using an LLM to automate the process. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Scanning and creating an embedding of the entire text of a copyrighted novel without permission is likely a violation of the copyright owner's exclusive rights.
- Creating a website to answer questions about the novel using the embedded content may infringe on the copyright owner's exclusive rights, especially if it involves reproduction or distribution of copyrighted material.
- Using an LLM to answer questions directly, if it involves reproducing excerpts from the copyrighted text, could also infringe on the copyright owner's rights.
- Fair use may provide a defense in some cases, but it is determined on a case-by-case basis considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for the copyrighted work.
</answer>
<summary>
<!-- when summarizing, wrap it in this -->
You may have violated copyright law by scanning and creating an embedding of the entire text of a copyrighted novel, as this likely infringes on the copyright owner's exclusive rights to reproduce the work. Additionally, using this embedded content to answer questions on a website, and potentially using an LLM to automate this process, could also constitute infringement, particularly if excerpts of the copyrighted text are reproduced. While fair use might provide a defense, it is not a blanket exemption and must be assessed on an individual basis with respect to specific factors outlined in copyright law.
</summary>
<disclaimer>
<!-- optional -->
This response is not legal advice and should not be taken as such. For legal advice regarding specific situations, consult a qualified attorney.
</disclaimer>
this is what I get when I add stop=["<summary>"]
<notes>
<!-- The user is asking about the legality of using copyrighted material in a specific way, which involves creating a vector store from a copyrighted novel and using an LLM to answer questions based on that material. -->
</notes>
<answer>
<!-- only core facts, as a list -->
- Creating a vector store from a copyrighted novel may involve making copies of the copyrighted work.
- Using an LLM to answer questions based on the copyrighted material may involve reproducing, adapting, or distributing the copyrighted work, depending on how the LLM uses the material to generate responses.
- The fair use doctrine may apply, but it depends on the nature of the use, the amount of the copyrighted work used, the effect on the market, and the purpose of the use.
- The distribution of excerpts from the copyrighted work in responses could potentially infringe on the exclusive rights of the copyright owner.
</answer>
obviously it needs to be adapted to your particular use-cases.
Api call and raw prompt
# Import necessary libraries
from openai import OpenAI
# Initialize the OpenAI client
client = OpenAI() # This uses the OPENAI_API_KEY from your environment variables
# Send a completion request to GPT-4
responseStream = client.chat.completions.create(
model="gpt-4-1106-preview", # Using GPT-4
#model="gpt-3.5-turbo",
top_p=0.1,
stream=True,
max_tokens=1000,
stop=["<summary>"],
messages=[
{"role": "system", "content": sys_prompt},
{"role": "user", "content": """
Let’s say I purchase a copy of Stephen King’s latest novel. I cut it up, scan it, chunk the text and create an embedding of it in my vector store. Next, I create a website “Get Answers to Questions about Stephen King’s Latest Novel”, where people can post their questions, and I answer them.
After a while, I start to use the embeddings from my vector store to find the answers.
So far, so good. I don’t think I’ve violated any copyright rules.
Next, I decide to use an LLM to answer the questions directly. So instead of me taking the question, submitting it to the vector store and rendering an answer, I now let the LLM do this. It does not return any of Mr. King’s novel text, just it’s responses, which may or may not contain excerpts (depending upon the question).
My question is, where have I violated existing copyright law in either case?
"""},
]
)
sys_prompt = f"""
You are an expert and friendly copyright advisor. You are tasked with answering any question about copyright and offering expert advice. Do not discuss other topics. Stick to answering using the context you are provided.
When a user asks a question, analyse whether you have access to enough information to answer the question. If you do, go straight to step 2 and answer the question. If not, you will execute the following two steps:
Step 1 - Try to figure out the specific problem behind the user's question. Ask the user one or two short follow up questions until you figure out their exact intent and specific problem.
Step 2 - Answering: Once you have a good understanding of what the user means, Craft an answer using only the context you are provided. Answer the question by first creating an engaging opening statement and then by answering in a clear and specific way. Use between 100 and 180 words and only use the context provided for information. Try to minimize the use of bullet points unless strictly necessary.
---
---
Additional rules to follow when responding:
- Use a friendly and encouraging tone
- Combine the context together into a coherent answer
- If you do not know an acronym, don't try to come up with a meaning
- Try to use facts, figures and examples from the context in your answer when available.
Anything between the following `context` html blocks is retrieved from a knowledge bank and should be used as a basis for your answer:
<context>
{copyright_raw_strings[0:300000]}
</context>
remember: the user question can only pertain to the text in the context. if it doesn't, refuse to answer.
---
---
------
Start your answer with
<notes>
<!-- any initial notes before giving an answer -->
</notes>
<answer>
<!-- only core facts, as a list -->
</answer>
</summary>
<!-- when summarizing, wrap it in this -->
<summary>
<disclaimer>
<!-- optional -->
</disclaimer>
"""
import xml.etree.ElementTree as ET
tree = ET.parse('_ragnet/usc17.xml')
copyright_raw_strings = f"{ET.tostring(tree.getroot(), encoding='utf-8', method='text')}"
print(f"chars: {len(copyright_raw_strings)}")
1.14 million gpt-4-1106 tokens have been burned in the process or generating this post
what did we learn? sequence/intent matching can be fortified by meta-sequences