Unearthing Insights: Mining a Message Board

whafa · December 28, 2023, 1:48pm

Hi Everyone, first time short time. I have a technical background but I’m really starting from zero with GPT and LLMs. I am so green that I don’t know the nomenclature well enough to ask concise questions, so please help me by correcting me.

Suppose I have a corpus of ~18 million message board posts from a website I was very active on in the 'oughts. In a moment of boredom back around 2010, I tried to download every post, just to see if I could do it. Along with the message body, I extracted whatever useful metadata I could and loaded it into a SQL DB. I then sat on it, almost entirely unthought of, until this year when suddenly, for obvious reasons, it seemed relevant.

My nebulous goal is to ingest (is that the right word?) this data into a custom GPT, then 1) ask it questions of various users for whom I have content, and 2) Ask it to respond to questions in the style of the given user.

I’ve tried this already, using myself as a guinea pig, and the results are uncanny, especially considering the very low effort I put into extracting the text (I simply dumped the message bodies of my posts into a text file and fed it to GPT, with a CR/LF char between posts). I’m hoping I can make it even better by using the metadata, if possible, to guide the GPT.

I have these fields – can you help me use them more intelligently?
Post Date – Is there a temporal component to the GPT, such that I can somehow tell it when each post was made, and it can use that info later to piece together a history (for example)?
Message ID – Incrementing Int value uniquely identifies each post;
Subject – Relates individual posts to each other within a Board;
Board Name – Groups posts by broad topic;
Folder Name – Even broader grouping. Each Board Name is in only one Folder Name;
Author ID/Name – The person who wrote this post;
Thread ID – I think this correlates with Subject.
Recommendations – Users were allowed to “Recommend” a certain number of posts per day. This is a subjective, crowd-sourced measure of post quality;

I can also extract (but would have to go back to the source html files) the connections between replies and the post they’re replying to, which seems especially useful.

Can I use any of this? If so, how? I would like to prioritize the more highly recommended content, first of all, so a post with 100 recs is considered more “important” than one with zero recs. I also want to associate a specific user’s posts with that user, rather than ingest a mishmash of everyone’s text.

Also, how does the system deal with html tags? Because I have not cleansed the message text, and there’s some markup in it.

Please feel free to tell me I’m barking up the wrong tree with any or all of this; I only have the most basic understanding of what GPTs are doing “under the hood” – I just yesterday learned the concept of a token and how to tokenize a string of text. Many thanks!

EricGT · December 28, 2023, 2:11pm

An embedding is what you need.

https://platform.openai.com/docs/guides/embeddings

In short, Discourse, the forum software powering this platform, offers most, if not all, of the features you’re looking for. Their entire codebase is freely available on GitHub.

To achieve your goals, you can follow these steps:

Download the Discourse software and set up an instance (refer to installation instructions).
Install the Discourse AI plugin.
Migrate or import your data into your Discourse instance (find support in the migration category).
Keep in mind that Discourse frequently updates the Discourse AI plugin (see commits).

By following these steps, you should have most, if not all, of the features you’re looking for.

EricGT · December 28, 2023, 2:54pm

FYI,

As a moderator, I have access to a perk that enables me to request the AI to suggest a topic title. Two titles that appear to be useful are:

Decoding a Decade: Adventures in Post Harvesting
Unearthing Insights: Mining a Message Board

Feel free to change the title to one of these or another.

whafa · December 28, 2023, 6:00pm

@EricGT, thank you so much both for the clear direction and new suggestions for a thread title!

Both of the suggested titles are so weirdly appropriate (in larger ways than what I possibly could have conveyed in my single post) that I am again astounded by it, and now wonder how much the AI knows about me, personally. But it has to be coincidence. They’re equally good in my mind, so I asked ChatGPT to choose a random number between 1 and 2, and it chose 2, thus I am changing the title accordingly.

Thanks again!

EricGT · December 28, 2023, 6:03pm

The AI provided me with 5 titles, and if I ask again, it will provide 5 different titles. I only kept the two that appeared valuable. About half of the suggested titles are not useful, and my goal is not to suggest or change every title. However, when it seems appropriate, I will make a note of them.

Topic		Replies	Views
Prompting with the chat/completions API against a large transcript file API	5	3533	October 4, 2023
Training with forum data? API	7	836	October 5, 2024
Questions about the embedding-based chatbot API embedding	4	102	December 15, 2024
Embeddings Depth and Preparing Canonical Documentation for AI API gpt-4	0	85	July 25, 2024
Use "private" dataset as basis for AI responses Prompting	29	2642	December 16, 2023

Unearthing Insights: Mining a Message Board

Related topics