Hi Everyone, first time short time. I have a technical background but I’m really starting from zero with GPT and LLMs. I am so green that I don’t know the nomenclature well enough to ask concise questions, so please help me by correcting me.
Suppose I have a corpus of ~18 million message board posts from a website I was very active on in the 'oughts. In a moment of boredom back around 2010, I tried to download every post, just to see if I could do it. Along with the message body, I extracted whatever useful metadata I could and loaded it into a SQL DB. I then sat on it, almost entirely unthought of, until this year when suddenly, for obvious reasons, it seemed relevant.
My nebulous goal is to ingest (is that the right word?) this data into a custom GPT, then 1) ask it questions of various users for whom I have content, and 2) Ask it to respond to questions in the style of the given user.
I’ve tried this already, using myself as a guinea pig, and the results are uncanny, especially considering the very low effort I put into extracting the text (I simply dumped the message bodies of my posts into a text file and fed it to GPT, with a CR/LF char between posts). I’m hoping I can make it even better by using the metadata, if possible, to guide the GPT.
I have these fields – can you help me use them more intelligently?
Post Date – Is there a temporal component to the GPT, such that I can somehow tell it when each post was made, and it can use that info later to piece together a history (for example)?
Message ID – Incrementing Int value uniquely identifies each post;
Subject – Relates individual posts to each other within a Board;
Board Name – Groups posts by broad topic;
Folder Name – Even broader grouping. Each Board Name is in only one Folder Name;
Author ID/Name – The person who wrote this post;
Thread ID – I think this correlates with Subject.
Recommendations – Users were allowed to “Recommend” a certain number of posts per day. This is a subjective, crowd-sourced measure of post quality;
I can also extract (but would have to go back to the source html files) the connections between replies and the post they’re replying to, which seems especially useful.
Can I use any of this? If so, how? I would like to prioritize the more highly recommended content, first of all, so a post with 100 recs is considered more “important” than one with zero recs. I also want to associate a specific user’s posts with that user, rather than ingest a mishmash of everyone’s text.
Also, how does the system deal with html tags? Because I have not cleansed the message text, and there’s some markup in it.
Please feel free to tell me I’m barking up the wrong tree with any or all of this; I only have the most basic understanding of what GPTs are doing “under the hood” – I just yesterday learned the concept of a token and how to tokenize a string of text. Many thanks!