Add user's vocabulary or reading level

Mind · September 2, 2022, 11:01am

I’m really fascinated by GPT-3 and how it could be used for language learning. I’m wondering how if anyone’s thought about how it could be used with software like Anki. I’d love something like being able to input my vocabulary (exported from Anki), and then have GPT-3 chat with me, but using my vocabulary.

Anyone play with ideas like this, and if so, any tips/ideas for how to teach this to GPT-3? I think the vocabulary gets too big to input into the prompt (for the language I’m learning it’s about 1000 words).

Fusseldieb · September 7, 2022, 8:12pm

I would particularly fine-tune a davinci model for this task. Input your vocabulary and suggest the output in your vocabulary (eg. import a whatsapp chat with someone that has the same one)
The longer the list, the better.
However, don’t quote me on this.

kasparrothenfusser · October 24, 2023, 4:36pm

I have the same problem, so please share if you find something better than the below:

An approach I am trying with mixed results is to create a vector database using embeddings representing the allowed vocabulary . you can then replace words that are not allowed in a post processing step of the AI response by searching for the closest word using again embeddings.

However, if you - like me - are looking for a scalable solution, this creates way to much overhead through the post processing, especially considering that each word has several forms… Also, due to the syntax and grammar mixup you will produce, you need a third step in the postprocessing to get the grammar straight again.

again, this is a crappy solution, so please share if you find anything better!

if you only need a reading level, however, just telling GPT to use the reading level of a x year old, genearlly kinda works…

PaulBellow · October 24, 2023, 4:40pm

One screen-shot has “Hard to read” on the drop-down, but it’s actually “Super easy to read”… I’ve used Flesch and other things, so it can be done…

s

Macha · October 25, 2023, 6:48am

Could you not start a conversation with it and say
“talk to me about [topic of your choice], using the vocabulary list above and using all their relevant forms?”
or even
“Come up with an engaging fireside chat we can have together using the [vocab list] above. The only topics we can discuss should come from the vocabulary list and nothing else.”

I mean, so long as you can retrieve the vocabulary list as a string from whatever software/app you want to extract from, it can be done.
You can also practice using a language with GPT by chatting directly in that language. It’d be a fun exercise to see if you can translate those prompts on your own, then it would be able to role-play with you using the intended vocabulary in the language you’re learning.

kasparrothenfusser · October 25, 2023, 12:35pm

Thing is, I need to be able to grow the vocabulary over time starting very limited and up to all words that exists. GPT input limits (or any llm) for that matter make that impossible… So yes, for the beginning this will work, but very quickly it won’t. Furthermore, I already need the input limit to instruct other things and feed in the previous story section, character informations,…

so I really need to find a solution that works more or less outside of the input prompt.

Macha · October 25, 2023, 6:44pm

So, I would recommend reframing the approach here.

That concept won’t scale because language itself doesn’t scale in that way. There is no such list for all words that “exist”, unless you count a dictionary (which is still edited and modified constantly). Language use and vocabulary is constantly evolving, and is dynamic, meaning such a list is never going to be the same day-by-day, and even then there is debate over what could constitute as a legitimate vocabulary word. Is ‘lol’ a word now? How do you use it?
Also consider just because you have a list of vocab words, doesn’t mean the context and/or meaning will be the same as time progresses. “That slaps” is a good example of this.

I say this because LLMs at their core are essentially massive datasets of contextualized “words” in action. LLMs like GPT is already the closest thing you’re going to get to when it comes to a database with all words that exist and have been used in human history. Before LLMs there were corpora.

Now, I haven’t poked around to explore the capabilities and limitations of OpenAI’s APIs quite yet, but you can input a file, like a .json file that stores a list of words via ChatGPT and ask for it to use that data specifically. With the Advanced Data Analysis plugin at least, the information inside the file does not directly count towards input limits. You could write a program that pulls from a software or app a list of words, and match that with a dictionary list (a word dictionary, not a data structure), and create your own list/file to feed GPT from that combination of lists.

So in terms of programmatically creating and inputting your own list(s) with minimal effort, that scales. In terms of trying to go from a small subset of a handful of vocabulary words to memorizing the entire dictionary, does not.

If this is about creating a language learning technique personalized to your needs, that is certainly possible, manageable, and fun to do. Language learning with an agent like GPT has a lot of genuine promise actually. However, it’s also much more difficult and complicated than what you see on the surface. Second Language Acquisition is an entire subfield of linguistics, not to mention education and language teaching. You don’t need those to teach yourself a language, but there’s a reason duolingo exists as an app and not a bunch of flash cards.

kasparrothenfusser · October 26, 2023, 6:58am

thank you, that indeed sounds like an approach worth looking into! will let you know if something comes out of it…

Macha · October 26, 2023, 7:20am

Absolutely! If you need any more help, I’m happy to contribute.

Topic		Replies	Views
Fine tuning using a corpus API api	8	1901	July 13, 2023
Attempting to reduce the vocabulary of gpt 3 API gpt-35-turbo , fine-tuning	12	1522	February 2, 2024
GPT powered learning solution API api	21	2168	December 19, 2023
Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge API	13	4914	December 16, 2023
Use "private" dataset as basis for AI responses Prompting	29	2630	December 16, 2023

Add user's vocabulary or reading level

Related topics