Send CSV file for use in Chat Completion?

Is there a way to update a CSV as potential input for a Chat Completion? I see that there is a file update API (but apparently only for fine tuning) but the chat completion API doesn’t mention files.

GTP 4 told me this:

Unfortunately, it is not possible to give the ChatGPT API access to local files directly without uploading them over HTTP. The API runs on OpenAI’s servers and only has access to the data that is passed to it through its API endpoints.

However, if you have a large amount of data that you need to use with the ChatGPT API, you may want to consider using OpenAI’s Data API. The Data API allows you to upload your data to OpenAI’s servers and then use it with their language models, including ChatGPT.

To use the Data API, you would first need to upload your data to OpenAI’s servers using the POST /v1/data endpoint. Once your data is uploaded, you can use it with the ChatGPT API by specifying the data parameter in your API request.

Is this a hallucination, or an undocumented property on chat completion requests?

Hi @oliver.drobnik

Welcome to the community.

This is indeed hallucination/confabulation.

1 Like

Ok, but what about my first question. Is there a way to facilitate this flow of data?

If you read the docs, you’ll understand that you don’t need to upload the csv, instead you break it into chunks and obtain embeddings.

Here’s the boilerplate code given in docs:

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

Hello @oliver.drobnik
Yes you can, and it works very well. Here are the steps you can take.

  1. Parse the CSV file and for every row, modify the row to one line of key value sentence. For example
    Name, Product, Review
    Oliver, SD Card, I think this SD Card is great
    Nelson, SD Card, This SD Card is horrible
    will turn into multiple lines of the following…
    Name: Oliver, Product: SD Card, Review: I think this SD Card is great
    Name: Nelson, Product: SD Card, Review: This SD Card is horrible

  2. For every line, turn them into a separate embedding
    Encode(Name: Oliver, Product: SD Card, Review: I think this SD Card is great) => [223,231,132]
    Encode(Name: Nelson, Product: SD Card, Review: This SD Card is horrible) => [323,351,112]

  3. When a user ask a question, search for the most similar embedding and include them in the prompt.

Here is a screenshot of how it can work.
Reference Data


Chat Examples

If you like to see how it works in action before you implement it, you can try it at https://www.superinsight.ai

Hope that helps!

2 Likes

Please forgive my ignorance. Also, I’d like to understand the steps at the lowest level using only the JSON-REST API.

So it seems to me that I should be able to change my CSV into embeddings. That’s done via an OpenAI embedding model, right? like https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.

As I understand it here I now have two options: I should be able to give a list of embeddings as part of the messages in a chat completion request, right?

And if the list of embeddings is too big, then I would have to do it locally: I would put all the embeddings into a local vector database. Then I would turn the query itself into an embedding via the OpenAI API. Then I would do a local search for similar vectors in my local database. Then I would send those results again via a Chat completion request to OpenAI to answer, right?

I don’t see any way to send embedding vectors to the API. Would I have to save the source text together with the embedding vectors and then send the most relevant texts as part of the chat messages?

Why use embedding
The reason you want to use an embedding is because you want to find the most relevant context before you feed into your prompt, and also to manage prompt tokens. If this is not what you need, then you probably don’t need to use embedding.

What data to encode into embedding
You need to decide how to segment your data, to get good search results, you should try to separate your content base on their relevance. For example, a short paragraph of 300-500 tokens is a good chunk size. For CSV, you might also want to separate by rows if you might not to mix the rows in the same embedding. It’s really going to be base on your specific use case and the size of the text. To count the number of your tokens, you can use this package GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. to count the number of your tokens

Creating and storing the embedding
You can use OpenAI API or any other transformer models in the open source community. And yes, you can store embedding in vector database so you can perform semantic search. Regardless of which models you use, make sure you use the same model for all your embeddings.

Feeding it back to the prompt
Most Vector databases will only manage the embeddings for you, so when you perform a semantic search, most likely you will get back an ID and the embeddings with scores. You will need to map this ID to another database to retrieve the actual text. Once you have all your text, you might want to sort them by relevance and count your tokens again and make sure you don’t go over limit in your prompt.

Hope that answer your question, good luck!

1 Like

thank you for the extensive reply. Gives me a lot to think about for my specific use case.

Basically the core message is: I need to provide sufficient relevant facts together with my query for GPT to do its magic. If it is about text then embeddings and vector databases seem to be what most people use. I am beginning to understand why that is.

My goal is to have the data from my credit cards and bank accounts in one place from which I can pull data to build a budget from, possibly employing GPT to aid in the process, like categorising or coming up with good categories based on the transactions.

I generally avoid this as it may add latency to the process. Instead, I store the actual text (and other meta-data) with the vectors depending on the data model of the solution. Pinecone supports this as well.

Thanks for the added input Bill.
Cheers!

Hi Oliver,

You are correct, most of the techniques are using embeddings and vector database to do searches.
Another important technique is the art of prompt engineering, if you can find creative ways to use this into your workflow, it will make a lot of difference. Wish you the best of luck in your AI journey. Cheers.

I have a question here. Can we use this technique on quantitative data ?

Why use embeddings for quantitative data when a DB can get the job done.

IMO, embeddings are primarily used for data types that have semantic meaning, such as text, where the goal is to capture and represent the relationships and similarities between different elements. Embeddings excel at capturing the contextual and semantic information inherent in such data.

Ok so the context was, I am working on a use case where my input is a CSV file. I tried using langchain agent framework with gpt 3.5 turbo, but unfortunately not getting consistent results or getting wrong results and most of the times it’s giving me agent iteration limit or time limit issues.
I tried using max_iterations etc, but still the same.
I also read in couple of places that gpt-4 is necessary in the use of agents. Gpt 3.5 turbo makes many mistakes.

So, was wondering how can I deal with this still using gpt 3.5 turbo. Now, embeddings is something that works effectively on similarity search and works best on text, so would be helpful if can get some advice on how to deal with the issues.

Thanks

I have seen the legacy ‘/completions’ API supported adding an array of “choices” within the free-form “prompt”.
I’d like to see an example that uses the newer ‘/chat/completions’ API.

Update: The example prompt below seems to work, Is there is a better way?
prompt: “Where do elephants live? restrict to the following knowledge:”+ mySemanticSearchReducedText;

Response:

{
  "id": "chatcmpl-myid",
  "object": "chat.completion",
  "created": 1688942778,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Based on the provided knowledge, elephants do not have a specific habitat mentioned."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 93,
    "completion_tokens": 15,
    "total_tokens": 108
  }
}

Welcome @mandalorian

This is one of the many ways.

Alternatively you can specify a system message to pass the knowledge/context, like I did in

1 Like

This example and my prompt that worked yesterday did not work today.
Found another prompt that worked: "Don’t justify your answers. Don’t give information not mentioned in the CONTEXT INFORMATION that follows: "+knowledge

Yes, I agree.
The older model gives you a lot more flexibility on the prompt, but I think your prompt prefix is pretty good already. What I am seeing that makes the most differences now is the ability to create the highest quality of mySemanticSearchReducedText. So that’s where I spend most of my effort refining.

So if i fave a csv file of agricultural data, I don,t need to use embedding just turn to string to use api?