Feeding Own PHP Application Data and ask related Questions?

Hi,
Any body can help I want to feed my Own PHP application data into Open AI and asked dynamic question. The data is more than 10K+. and it is increased day by day. Will need to feed on daily routine.

Which API is best suitable for large amount of datasets.

Please suggest me the best solution.

Any help will be highly appreciated.

where is the data coming from? file? database?

you can use either chat completions or assistants api then use function calling to interface with the source of your data.

1 Like

I am putting my all data into a File stored in my server. check below is my code for reference

public function chatgpt_ask_question(Request $request) {

        $question     = $request->input('question');
        $jsonFilePath = public_path('daily_summary.json');
        $context      = file_get_contents($jsonFilePath);
        $answer       = $this->openAIService->askQuestion($question, $context);

        return response()->json(['answer' => 'HVG AI: '.$answer]);
    }
public function askQuestion($question, $context) {
        $contextChunks = $this->chunkText($context, $this->maxTokens - 1000);  // Reserving tokens for the question and response

        $responses = [];
        foreach ($contextChunks as $chunk) {
            $responses[] = $this->sendRequest($chunk, $question);
        }

        return implode("\n", $responses);
    }
private function sendRequest($context, $question)  {
        $response = $this->client->chat()->create([
            'model' => 'gpt-4o',  // Adjust as necessary
            'messages' => [
                ['role' => 'system', 'content' => 'You are a helpful assistant.'],
                ['role' => 'user', 'content' => $context . "\n\nQuestion: " . $question],
            ],
            'max_tokens' => 150,
            'temperature' => 0.7,
        ]);

        return $response['choices'][0]['message']['content'];
    }

i see that you are attaching everything in the user message. this will be okay for short runs but as you mentioned your data is growing everyday. you want to only refer to it when it is needed. i am not sure what kind of data do you have but you might want to define some functions that will take info from this json file based on user inquiry. let say your daily_summary.json contains the top upvoted posts and it is labeled as such. you can define a simple function get_top_upvoted_post and when it is invoked you then read the top upvoted posts and send it back to the API as tool output.

1 Like

@supershaneski FYI, the json file contain consumer name, email , orders, state, country etc … information in this file (the data is more than 10K+). Lets as an example the question would be how many users from USA ? what is the today order total ? etc.

i see. in such case let’s define a simple function:

{
  "name": "get_users_by_country",
  "parameters": {
    "type": "object",
    "properties": {
      "country": {
        "type": "string",
        "description": "country code based on ISO 3166-2",
        "enum": [
          "UA",
          "UG",
          "US",
          "UY",
          "UZ",
          "VA",
          "VC"
        ]
      }
    },
    "required": [
      "country"
    ]
  },
  "description": "Get number of users by given country"
}

this will be invoked when you ask questions like “how many users from USA?”. see sample usage below:

you do not need to feed the whole json. you can easily get your answer by parsing the json and just sending the result back as tool output like shown above.

1 Like

OK, but the problem is question could be anything related to data and I cannot define function for each questions.

for all easy questions, provide function. for the rest, use semantic search by doing RAG. this means chunking your data then getting the embedding for each chunk. if there is a new data, just chunk, get the embedding and save in your DB. create a general inquiry function like get_user_inquiry. it might be invoked like this:

get_user_inquiry({“inquiry”: “which order contains product ABC?”})

you get the embedding for “which order contains product ABC?” and use semantic search through your saved embeddings. see the embeddings API doc for reference.

1 Like

@supershaneski I implement this embedding API but still not working. I guess my data is very large the OpenAPI not able to handle.
do you have any other suggestion ?

Try to test it with small data just to see how it works.

1 Like

with small data it is working fine. I tested with gpt-4o model

You might look into a RAG solution… only add the context you need for the user’s query. That way you stay under limit every run.

Or you could work on pruning your thread once it gets too large - use tiktoken or guesstimate token usage on your end, and manage its size.

Can you share some code that you used? Did it throw errors or just not give you the right answers? Could be how you set it up…

Hi @paulBellow, see the below code I input the question and asked the relevant question based upon this json file.

public function chatgpt_ask_question(Request $request){
        if (Auth::check()) {
            $data = [
                'title' => 'Feed Data'
            ];
            $ask_qstn = $request->input('question');
            if($ask_qstn != ''){
                $jsonFilePath = public_path('daily_summary.json');
                $summary      = json_decode(file_get_contents($jsonFilePath), true);
                $prompt       = $ask_qstn." <br> ". json_encode($summary, JSON_PRETTY_PRINT);
                
                $response     = $this->queryChatGPT($prompt);
               
                return response()->json(['success' => 200,'message' => $response['choices'][0]['message']['content']]);
            }
            
            return view('admin.actions.chatgpt');
        }
        return view('admin.auth.login');
    }

public function queryChatGPT($prompt) {
        $client = new Client();
        $apiKey = 'sk-*****************LqzhC***************';
        $response = $client->post('https://api.openai.com/v1/chat/completions', [
            'headers' => [
                'Authorization' => 'Bearer ' . $apiKey,
                'Content-Type' => 'application/json',
            ],
            'json' => [
                'model' => 'gpt-4o',
                'messages' => [
                    ['role' => 'user', 'content' => $prompt]
                ],
                'max_tokens' => 100,
            ],
        ]);
        
        $body = $response->getBody();
        $result = json_decode($body, true);
       
        return $result;
    }

here’s a simple RAG sample

  1. for testing, i made a text file with this content:

mango.txt

Mango
A mango is an edible stone fruit produced by the tropical tree Mangifera indica. It is believed to have originated between northwestern Myanmar, Bangladesh, and northeastern India.[1] M. indica has been cultivated in South and Southeast Asia since ancient times resulting in two types of modern mango cultivars: the “Indian type” and the “Southeast Asian type”.[2][3] Other species in the genus Mangifera also produce edible fruits that are also called “mangoes”, the majority of which are found in the Malesian ecoregion.[4]
Worldwide, there are several hundred cultivars of mango. Depending on the cultivar, mango fruit varies in size, shape, sweetness, skin color, and flesh color, which may be pale yellow, gold, green, or orange.[1] Mango is the national fruit of India, Pakistan and the Philippines,[5][6] while the mango tree is the national tree of Bangladesh.[7]
Etymology
The English word mango (plural “mangoes” or “mangos”) originated in the 16th century from the Portuguese word, manga, from the Malay mangga, and ultimately from the Tamil man (“mango tree”) + kay (“fruit”).[8][9] The scientific name, Mangifera indica, refers to a plant bearing mangoes in India.[9]

  1. i then chunked the text data in 3 parts, then using text-embedding-3-small model, i called the embedding api to get the vector data for each chunks
text-embeddings [
  {
    embedding: [
         0.04287149,     0.01839651,   0.025731485,   0.02845928,
      ... 1436 more items
    ],
    text: 'Mango A mango is an edible stone fruit produced by the tropical tree Mangifera indica. It is believed to have originated between northwestern Myanmar, Bangladesh, and northeastern India. [1] M. indica has been cultivated in South and Southeast Asia since ancient times resulting in two types of modern mango cultivars: the "Indian type" and the "Southeast Asian type".'
  },
  {
    embedding: [
        0.02501605,   0.027974246,    0.05487668,   0.064951696,  -0.023451207,      ... 1436 more items
    ],
    text: '[4]  Worldwide, there are several hundred cultivars of mango. Depending on the cultivar, mango fruit varies in size, shape, sweetness, skin color, and flesh color, which may be pale yellow, gold, green, or orange. [1] Mango is the national fruit of India, Pakistan and the Philippines,[5][6] while the mango tree is the national tree of Bangladesh.'
  },
  {
    embedding: [
        0.042156275,  0.010176958,   0.008787963,   0.037421804, -0.024364166,
       -0.005610028,   0.01934865,  0.0076097483,  -0.008755534,  -0.03653544,
              ... 1436 more items
    ],
    text: '. [8][9] The scientific name, Mangifera indica, refers to a plant bearing mangoes in India. [9]'
  }
]
  1. now, using chat completions, i sent a query

where does mango originated from?

  1. i called the embeddings api to get its vector data

  2. using semantic search, i used the query’s vector data against the previously saved embeddings result. got hit and sent it back to the chat completion api

{
id: ‘chatcmpl-9bkU4pR1GPUwVJboUfXIjwudNmIqm’,
object: ‘chat.completion’,
created: 1718783848,
model: ‘gpt-3.5-turbo-0125’,
choices: [
{
index: 0,
message: [Object],
logprobs: null,
finish_reason: ‘stop’
}
],
usage: { prompt_tokens: 385, completion_tokens: 69, total_tokens: 454 },
system_fingerprint: null
}
{
role: ‘assistant’,
content: ‘The mango is believed to have originated between northwestern Myanmar, Bangladesh, and northeastern India.\n’ +
‘\n’ +
‘I found this information in the file “mango.txt” where it states: “A mango is an edible stone fruit produced by the tropical tree Mangifera indica. It is believed to have originated between northwestern Myanmar, Bangladesh, and northeastern India.”’
}

1 Like

thanks @supershaneski for this I also chunk my data it is json file. see below example for the json file. Might be you get the idea what I actually need.

Sample json data
it is a single data. I have arround 10k+ data in same json format. I did the chunk but unfortunately it show Gateway timeout.

{“consumers”:{“email":"example@example.com”,“full_name”:“Abc xzy”,“consumer_type”:“subsriber”,“investment_amount”:“”,“address”:“qwerty”,“city”:“Avon”,“state”:“CT”,“postal_code”:“123456”,“ip_address”:“8.8.8.8”,“purchased”:“Yes”,“date_paid”:“19-06-2024”,“purchase_date”:“19-06-2024”}}

Below are the question

  1. how many purchased with Yes ?
  2. How many user with US ?
  3. which date has most purchased ?

there are so many question I can ask related to this data .

Thanks

I itself solved this by using the vector store method

1 Like

Nice. Share some code in case someone else stumbles on this thread? :slight_smile:

I would share, once finished couple of issues yet the AI is giving wrong response sometimes.