Summarizing large amounts of data

hey all,

Need some help understanding the best way to do something. I have a bunch of data in a database that is already summarized, it looks something like this:

date name deals_closed percent_change
Feb 2022 agent d 0
Jan 2022 agent c 3
Feb 2022 agent c 2 -33.33
Jan 2022 agent b 12
Feb 2022 agent b 26 116.67
Jan 2022 agent a 4
Feb 2022 agent a 2 -50.00

I have prompted GPT to “summarize this data,” and it gives me a response like this:

  • did not close any deals in February 2022.
  • closed 3 deals in January 2022 and 2 deals in February 2022, representing a percent change of -33.33%.
  • closed 12 deals in January 2022 and 26 deals in February 2022, representing a percent change of 116.67%.
  • closed 4 deals in January 2022 and 2 deals in February 2022, representing a percent change of -50.00%.

Which is perfect. Now let’s say I have a database table with thousands of records. What would be the best way to load all this data and ask for this summarization on the full dataset? Do I load the data incrementally and ask it to summarize that portion and then do a final summary of all the summaries? Is there another solution? Are there things I need to be mindful of? Thoughts? TIA!

1 Like

It looks like you have everything to write a simple script instead of using GPT. This will save money and prevent hallucinations.

But if you insist on using GPT, just send chunks of data to GPT in batches to prevent going over the max tokens for whatever model you are using.

1 Like

Thanks again @curt.kennedy, I prefer to use GPT as when I asked for a detailed summary, it does a better job than I would in any script. You mentioned chunking the data into GPT; in terms of the summaries, do I ask for a summary in each batch and then do a final summary of those batched summaries?

1 Like

I don’t know what your summary of summary would look like, or what you are wanting here.

It looks like you are just linearly translating the data in your database to sentences of the current month and the previous month. So what does an additional summary look like?

Just summarize each batch like you did, and be sure not to cross a date boundary for a rep of the current and past month.

1 Like

Thanks @curt.kennedy , I’m not sure what the summary of summaries would look like either. since it’s a lot of data, and I can’t pass it with a token limit, I was thinking of a way to summarize in chunks.

When I do a bit of prompt engineering, I’m able to get something nicer like this:

  1. Closed 11 deals in January 2022 and 26 deals in February 2022, representing a percent change of 136.36%. Tom’s performance improved significantly between January and February, and he was one of the top-performing agents in your team in February.
  2. : Did not close any deals in February 2022. You may want to follow up with Jake to see if there are any challenges he’s facing or if there’s anything you can do to support him.
    …

Which is why I want to use GPT to do this.

2 Likes

That certainly looks nice. Just keep doing that in chunks.

Unfortunately, depending on your data size, the max context will be your limiter. GPT-4 has a 32k window on the horizon. Which model are you using?

1 Like

Hi agree with @curt.kennedy , this is a perfect use case for a script. You have a really nice hammer and you are seeing nails where there aren’t any.

The fact that you don’t know what a good summary looks like, is the problem. You don’t know what kind of data you want and how it should look. If you did, you would know how to make the script return the data you need.

3 Likes

Hi @chirag.shah285,

did you already had the chance to take a closer look at embeddings? If not please take a look into that here: Embeddings - OpenAI API I believe with some fine tuning this might be exactly what you are looking for.

Why? Because you leverage the following features to consolidate the Information in a first step, then continue with your LLM’s to get further insights.

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Does this go in the direction you are looking for?

Best regards,
Linus :slight_smile:

2 Likes

@linus As an embedding fan here, I’m scratching my head how embeddings could be used for this problem. This is a database scan and summarization problem, where is the clustering/search? Enlighten me.

2 Likes

@nunodonato Totally agree!

My only afterthought was if the Data presented here should later on be used with a LLM and should be callable from inside the model. Thus using this approach would add some complexity. But this was not stated in the original post here.

@nunodonato Can you specify if you want to process the original data later on or do you only want the summary? In the later case I’d advice to follow what @nunodonato said!

1 Like

Thanks for the insight @nunodonato. My rebuttal would be, why not use the benefits of LLM to think through creative ways of summarizing the data? A script for this would take quite a bit of time; whereas GPT does this in quick work with a unique summarization, I may not have thought through it.

1 Like

Hi @curt.kennedy,

thank you for your question - I’ll point out the use cases I saw there - I’d love to hear your feedback on that because I have limited experience on this and suggested this as a additional option to consider.

  1. Use the embedding to query all the datapoints for a certain employee, sorted by date. Feed this information into Leonardo or another model with a prompt to summarize the information. Repeat that for every employee.
  2. For that please refer to the section “Obtaining user and product embeddings for cold-start recommendation” on Embeddings - OpenAI API, with this you can prognose weather a user would like a product before receiving it. My thougt is that we change the user rating with the number of deals closed as this represents also a variable of interest in this case. (but im not sure on this one)
  3. Use Anomaly detection to figure out if there are deviations from the performance regarding # of deals closed based on the user and date. This would potentially be interesting Information to point out. So for example if you can say that the deals closed by Mat Gundell were a lot higher than the ones from his colleagues this would potentially be of interest
2 Likes

I can see where you are heading here. But for this, I would go with a database query on the employee ID, sorted by date. You could use embeddings for this, but what if you had employees with the same name “John Smith”??? … You need a unique identifier so-as to not mix employees together. Remember he as 10,000 employees or rows, so there is a high probability of collision and also embeddings may not perfectly fetch all the data, and could leak spurious data of other employees.

This is interesting, but I’m not sure how embeddings are used for anomalies. I mean conceptually, yeah, take old embedding, take new embedding, see the angle, if it’s big then do X … still vague to me. But pure scripting and taking derivatives of the data would be more straightforward to me.

The link shows a similarity of user vs. rating plot. So you would compare similarity of X vs deals plot, probably … this actually could be cool if you found a good similarity of X metric, maybe similarity of how the rep presents the information to the consumer or something like that … in this case, I would say this is where embeddings shine! But you need this data on each rep. I definitely have a bunch of ideas now!

3 Likes

Thank you for your great feedback! This is really helpful and also helps me to understand embeddings better so thanks for taking the time! :slight_smile:

Totally agree! As mentioned Summarizing large amounts of data - #7 by nunodonato here, there are a lot of tool with better fit for this task. I personally would use PowerBI in this case and do further analysis there on :slight_smile:

My thought on this was basically something like figuring if someone performs better or worse on dates correlated with a high or low overall deals closing performance. So to find out if someone is over- or underperforming. Another approach would be to see if there are some temporal performance impacts (e.g. Mat is better in Months with Jan then in the other months). But this is only a thought of mine as I never have tried this out.

I am very glad to hear that you could take some ideas from this post! Thank you from my side for letting us know what is worth pursuing! :+1:

3 Likes

Yep. Same here.

When I see a pattern like this, I get the sense that a pre-summary data aggregation or rollup is necessary. Think about the reader of these summaries. For starters, they probably want you to distil the assessment. If you don’t aggregate, you will have hundreds(?) of summaries? How is that a summary (one must ask)?

I encourage you to define the requirements a bit more (kind’a what @nunodonato was mentioning) and sketch out the definition of what it means to summarize in the context of a lot of summations.

3 Likes

A possible summary is not at the individual level, but at the organizational behavior layer. For example, say we identify the “best selling reps”, and we capture their outputs to consumers while they are trying to sell widget X. We embed their outputs to consumers, and arrive at a set of embeddings. In fact, we collect and embed all this data for all the sales reps. Then, we find the average vector of the good reps, and we plot all reps against this average “good” vector. You will then see a variation (over angle) of good reps and bad reps. The “good” reps will be close to 0 degrees, and the bad reps will vary off of 0 degrees. You could then start focusing on what makes your reps good and bad with one-on-one interactions and training.

4 Likes

Hi @curt.kennedy et all - I have also been doing some digging around the best options to query structured data (a database) with a large language model. I have come across three techniques, two of which I am still investigating further 1) Multii-modal Retrieval of Tables and Texts Using Tri-encoder Models in which they extract all the data from the individual tables and then create embeddings of same but also provide semantic detail thereto so its more interpretable by the large language model. 2) Then I saw another guy was using the LLM to create SQL queries directly into a database and that seemed to work quite well but it was a “simple” test - I am concerned that as the database / lake grows the challenge would be to manage the semantic layer in the schema etc and that could be addressed perhaps via fine tuning so that the llm can make sense of the underlying data…

1 Like

@curt.kennedy is right – The best bet is to first run a query and get the limited set of rows (so if you have 100,000 rows, get the 100 that best meet your query). And then send those 100 as a markdown table to GPT-4 and ask for the summary. There is currently no way to send 100,000 rows to GPT-4.

Embeddings come into the picture only if you want to “pick and choose” some of the 100,000 rows based on some natural language text - but you will get tremendously inaccurate results if you do analytical queries like “Who is the best selling rep in Northeast?” or “How many sales were made last month?”

1 Like