Is there any sample code to split a json file into smaller chunks?

For example… I got a json file that looks kind of like this:

[
{
“id:”: 1,
“question:”: "What are your hours of operations? ",
“answer:”: "We open at 10am and close at 8pm. "
},
{
“id:”: 2,
“question:”: “Do you offer vegetarian food?”,
“answer:”: “Yes, we have a vegetarian menu to choose from”
}
]

^^^ Now how can I split that into chunks to add it to the text-embedding-ada-00?

The sample code below is giving me TypeErrors

let embeddings = ;
while (inputs.length) {
let tokenCount = 0;
let batchedInputs = ;
while (inputs.length && tokenCount < 2048) {
let input = inputs.shift();
batchedInputs.push(input);
tokenCount += input.slice().length;
}

let embeddingResult = await openai.embeddings.create({
  input: batchedInputs,
  model: "text-embedding-ada-002",
});
console.log(embeddingResult);

One of the glaring issues is that the text api doesn’t know what to do with a json object.

you could stringify your object first, but I’d urge you to consider what you’re actually trying to accomplish here.

what part of

{
“id:”: 1,
“question:”: "What are your hours of operations? ",
“answer:”: "We open at 10am and close at 8pm. "
},

actually needs to be embedded?

The structure you show is itself not a typical json. API json starts with curly brackets and an array in square brackets would need to be a value of a key. Here we have an array.

However, what you have looks like a python list, and would be expected to come as strings from a database, and can be processed as such. Lets stick your short example within AI code and separate AI analysis.

const dataString = `[{
    "id": 1,
    "question": "What are your hours of operations?",
    "answer": "We open at 10am and close at 8pm."
},
{
    "id": 2,
    "question": "Do you offer vegetarian food?",
    "answer": "Yes, we have a vegetarian menu to choose from."
},
// Add more objects as needed to reach a total of 20
]`;

// Parse the input data into an array
const jsonData = JSON.parse(dataString);

const batchSize = 5; // Maximum number of inner objects per batch

const splitData = [];
for (let i = 0; i < jsonData.length; i += batchSize) {
    splitData.push(jsonData.slice(i, i + batchSize));
}

// splitData now contains smaller arrays with a maximum of 5 inner objects each
console.log(splitData);

The code reads a JSON string, parses it into a JavaScript array of objects, and then splits this array into smaller arrays, each containing a maximum of 5 objects.

Here’s a breakdown of what the code does:

  1. const jsonData = JSON.parse(dataString); - This line parses the JSON string into a JavaScript array of objects.
  2. const batchSize = 5; - This line sets the maximum number of objects per batch.
  3. The for loop then iterates over the jsonData array, incrementing by batchSize in each iteration. In each iteration, it creates a slice of the jsonData array from the current index i to i + batchSize, and pushes this slice into the splitData array.
  4. console.log(splitData); - This line prints the splitData array to the console. The splitData array contains smaller arrays, each with a maximum of 5 objects.

However, please note that this is JavaScript code, not Python code. If you need a Python solution, the approach would be different.

1 Like

Well thank you… this seems like it should work… and yes, I am using javascript…

I need to embed everything and more… its for a q&a bot application…

Say… i noticed a chunk size of 5, and a note of 20 objects…

Do you think 5 is the proper size to feed the chunck to pinecone?

Would you say about 5 more objects of about the same size should be the max size of the document?

Now… this is just to get the data into pinecone… after that i need to query the file and feed the data to openai to get a good final response

You can embed whatever quantity of data you want, and then get more returns if they are individual instead of larger chunks. I just included the specification so you can alter it to your desires.

The case where documents need to be split in the example code is where you are trying to embed documents beyond the 4k tokens of the embedding model, and thus must provide an AI with mere chunks of it for retrieval.

If each question is separate, why would you embed many of them into one vector?
Isn’t the whole point to get the most relevant questions?
If so, I would embed each of the questions separately (and not the answer.)
Then you can find the 10 closest embeddings when a user asks a question, prime the model with both question/answer, and ask the model the answer the user’s question given that context.

Hello!

Well… why so many questions in one file? … that is a good question… the reason is because I don’t know well what I am doing here…

The objective is to create a Q&A bot that will answer to common questions related to an specific topic… say a restaurant for example.

Apparently one file loaded with questions and answers is not the best? – What would be the best way to go about building this bot?..

  1. I looked at prompts… but prompts might get expensive and loading 20-100 question/answer might be too much for the prompt… and what if the number of questions grow over time?

  2. I looked at function calling… but function calling for what I have seen is not too smart on its own to formulate its own natural language response given a set a Q&A… (so far for what I have seem with my limited knowledge)

  3. Feeding vector files and provide that information to the bot so that it can answer questions… <<<— on this 3rd one I got a Langchain (20 questions split in chunks) code working in plain Node.js, but I want to do it using NEXT.JS… long story short I want to do it straight with OpenAI in Next.js…

Some of the challenges I have is that well… I don’t know what are the proper steps needed to prepare the data, lots of the code samples (from Pinecone and OpenAI) or videos are in Python rather than JavaScript… so with some reading, guessing, and forum questions I get to have a bigger picture of how it works…

Retrieval Augmented Generation is simple at the core.

  1. For each question, calculate an embedding vector.
  2. Store each of those vectors in a vector database together with the intended answer.
  3. When the user asks a question, find the N closest answers in the database (N=5-15 typically.)
  4. Craft a prompt that says something like: “Your task is to answer the user’s question given the following information: …” followed by each answer you found, concatenated, followed by “The actual question is: …”
  5. Profit!

You may want to add additional instructions, such as “if the answer is not found within the given information, answer that you don’t know” and such.

1 Like

Hey my man,
I am currently teaching myself how to use Pinecone, what vector DB’s are etc.
I am also pretty new to programming, and use Javascript.

You seem pretty much on point with what you are trying to do however, you are over complicating it a little bit.

Vector DB search’s are, to the layman, pretty much magic, we won’t talk about the math, but the quality of the search is amazing. You will need to play around with the different types of measurements to check what works best for you. I think it was “euclidean” was what I was using but you might have to do a quick google search.

If you query a VdB with:

  • What are your hours of operations?
  • What days are you open?
  • When can I come in next?

All of these search queries are going to relate to things like hours of operations.

The way you are formatting and processing your data isn’t optimal.

I would reformat your JSON data so that it isn’t in the question-answer format you currently have. For hours of operations I would have something like:
Hours of Operation:
M -
T-
W-…

After you have provided this information I would add a couple of tags to it.
[hours of operations, opening hours, availability] .
!!!(NOTE: Pinecone allows you to add metaTags to the stored embedding, I just haven’t played with this enough yet and the tags in the embedding is a functional away to achieve the result, probs not best practice, but it works well )!!!

This is what I would add into the VdB. It is easy to search and easily defineable as to what it is.

In regards to the vegetarian food I would go and I would create your menu. Inside the menu you have all of your different food categories and I would create each category as an individual data object.

So Menu
vegetarian
Item 1:
Description:
Ingredients:
Dietary: GF, Veg, etc

Then we want to add data tags
[menu item, vegetarian meals]

Depending on the size of your menu, and your acceptable token-cost you may be able to keep this in one embedding.

Additionally you could create an embedding for each meal type, VdB allow you to get many search queries in order of relevance.

What you then do is allow the user to ask their chatBot question, embed the question, query the VdB, get the id for the data and retrieve the data from your JSON (a VdB does not store the text).

You then get this question, and the user data, built a prompt template, insert the question and user data into the prompt template and boom, your chatbot will be mint.

TLDR:
You aren’t structuring your data right, you are trying to use an almost chatbot module to define your answers. This is a little dangerous as your chatbot isn’t using data to determine it’s answer but your preset questions and answers, this might produce issues if

  • the queried question is to close to the answer,
  • the queried question is worded weirdly,

I propose sit down, think of all the possible questions you want to answer and then build your data around what data answers those questions, not what the answers to the questions are, embed that and supply that to the openaiApi rather then what it should be answering.

Good Luck.

P.s. Hi, I’m Jayden, I really need to break into this industry and get some work experience behind me. If you need a developer, who will work for free, and you have a project where I can learn something please hit me up. I’d love to slave for you.

1 Like

Thanks Jayden!

So Pinecone apparently has a few ways on how to do this…

One way on their github example is that they include the text of the embedding in the metadata, and then you can respond to the openai with an array of the text in the metadata under the assistant role… and from there openai has some tools to provide an answer…

Another way that Pinecone said it could work is by referencing the ID number of the embeds, then you would need to match that ID number with your file, and supply that information to openai…

And yes, I ended up modifying the json file to something more like you mentioned.