What can I do with a huge forum?

tasosv · November 19, 2023, 10:06pm

We have a forum (similar to this one) with about the following stats:

More than 300 millions posts
A few billions tokens in total
More than 20 languages

We want to use GPT to be able to get answers on the following questions

what user X said about subject Y
summarize the thread Z
what is the sentiment of thread Z
identify pain points and solutions of subject Z

I am not sure which is the best approach for this. Should we convert everything in embeddings and querying those? Use Assistants API and then use one single thread for the whole forum? Which one will give us the highest flexibility with the lowest cost trade off?

I would love to read your opinions.

_j · November 19, 2023, 10:38pm

and you have…

consent at the time of message creation to license user content to submit to another company to produce new works, etc.

Obtaining a text-embedding-ada-002 vector is $0.10 per megatoken, $100 per gigatoken (billion), so a single forum embedding run is thus a “few billion tokens” = “few hundreds of dollars” for embedding.

Then you would need a strategy for individual posts that are more than will fit in context of the embedding model, such as truncation, or average the vector score of chunks.

That just gives you a semantic similarity database. You’d be able to add your own metadata like “is first post of thread”, is reply to post x, etc. Or simply add that vector to every forum post for later use.

Embedding is the cheapest thing you can do, which could power a slow search function, or you can find how similar post are to a set of “happy posts” or “angry posts”, to some degree you could experiment with. You could make only the last year searchable for a start.

Language model inference? Up that to $10,000 per gigatoken of GPT-4-turbo for input alone, and days of processing for both rate limit and generation time of what you might want.

Consider the tool you actually envision. “summarize on demand” has a cost that grows depending on how many sub-summaries are needed on gpt-3.5-turbo-1106 (16k) (or a dollar a button push for a 90k token GPT-4 thread summary).

That’s the end of considerations, “opinions”, before it then becomes consulting.

Topic		Replies	Views
Answering lots of questions from one large chunk of text without paying tokens to input the big text chunk for each question? API api	16	6998	December 24, 2023
Training with forum data? API	3	399	April 15, 2024
Cost when building chat with text with embeddings and chatgpt 4-128k API embeddings , gpt-4 , chatgpt	6	2708	December 22, 2023
Best method of injecting relatively large amount of context to be leveraged in a response API	10	6382	December 17, 2023
Teaching GPT the information it will be working on API gpt-4 , assistants	8	1458	November 19, 2023

What can I do with a huge forum?

Related Topics