GPT-3 for terminology extraction

I am considering using GPT-3 for terminology extraction for glossaries.

The idea is this:

  • First, extract all text content from source materials using a regular expression function.

  • Next, segment the text into sentences using a regular expression function

  • Then, tokenize the text into multiword expressions using GPT-3 (I already have an idea of how to do this)

  • Next, pair each term to 3 sentence matches from the source text, in a Python dictionary

  • Finally, use the term plus three context sentences as a prompt for GPT-3 to decide if the term is specific and relevant enough for a glossary.

I am going to attempt this, but am open to discussion with anyone who is interested in this topic and has other ideas or wants to try it themselves.

Best regards.

1 Like

You might be reinventing the wheel. There are plenty of NLP resources that can do some of this already. For instance SBD (sentence boundary detection). I once fine-tuned GPT-2 to do that task.

Anyways, you’re thinking about this old-school. GPT-3 can do this in one step without finetuning as a zero-shot. You can get even better performance with finetuning or few-shot:

Extract keywords from the following passage:

Passage:
Kowloon Walled City was an ungoverned and densely populated de jure Chinese enclave within the boundaries of Kowloon City, British Hong Kong. Originally a Chinese military fort, the walled city became an enclave after the New Territories were leased to the United Kingdom by China in 1898. Its population increased dramatically following the Japanese occupation of Hong Kong during World War II. By 1990, the walled city contained 50,000 residents[1] within its 2.6-hectare (6.4-acre) borders. From the 1950s to the 1970s, it was controlled by local triads and had high rates of prostitution, gambling, and drug abuse.

In January 1987, the Hong Kong government announced plans to demolish the walled city. After an arduous eviction process, and the transfer of de jure sovereignty of the enclave from China to Britain, demolition began in March 1993 and was completed in April 1994. Kowloon Walled City Park opened in December 1995 and occupies the area of the former walled city. Some historical artefacts from the walled city, including its yamen building and remnants of its southern gate, have been preserved there.

Keywords:
-Kowloon Walled City
-de jure
-ungoverned
-population
-Japanese occupation
-triads
-high rates
-eviction
-Kowloon Walled City Park
-yamen
-southern gate

4 Likes

Awesome, thank you so much for assisting me and stimulating my way of thinking. I’ll definitely try that out. GPT-3 can be counterintuitively intelligent.

2 Likes

I’m gonna try this out quite soon.

My current understanding is that GPT-3 is basically so smart that one might always consider attempting to get it to perform a task zero-shot - just explicitly telling it what you want it to do - before attempting few-shot or quite a number of examples (which I guess people call “fine-tuning”).

This is a minor detail, but do you think even in the zero-shot case it’s essential to add in a template for it to complete? In your example, that would mean adding a few blank hyphens so it understands to fill in more.

Perhaps the answer is just “try it and see” but I’d be interested to know if you have any thoughts about this.

Plus, just out of curiosity: has anyone found GPT-3 does better or worse if you give it instructions in the second person: “Do this:” - or in the first person, as if it’s GPT-3 that’s speaking: “I will:”?

Thanks very much.

Yeah, I found that giving it hyphens is critical so that it knows the format of the list.

1 Like

Just for reference, I tried something similar but it didn’t quite work:

import openai

openai.api_key = "mykey"

e = "ada"

p = """ Which of the following are animals?

        book train dolphin penguin happy Simon dog cat bear fish reptile George dinosaur President flying

        - dolphin

        - penguin

        - dog

        -

        - """

t = 200

             

c = openai.Completion.create(engine=e, prompt=p, max_tokens=t)

print(c)

Returned:

"text": " \n childbirth\n - cuddle feet\n \n\n \n \n\n ~\n ~ - ~ ~ ~ ~~*~ ~~~ *~ *~ *~ *~ * ~~~ * ~~ ~*~\n ~~ ~~*~ **~~** ~~*~ * ~~~ - ~~*~ * ~~*~ ~~*~ *~~** ~~~~~~~~~~ ~~*~"

Do you think it’s just a matter of adjusting with DaVinci, low temperature and more examples?

Thanks very much.

Try INSTRUCT instead