Terminology extractor critique / suggestions

I am trying to design a terminology extractor with GPT-3 and I am following the template in the documentation for a tweet classifier at the moment. That means the engine is davinci, the response length 64, the temperature 0.2, the Top P 1, the stop sequence “two enter keys”, and a highly structured prompt which involves a natural language instruction and about 9 examples with 5 sentences for completion.

This is my prompt, which copies the tweet template completely:

This is a terminology extractor.

Sentence: “The rocketship is fueled with Alpha-Omega Jet Fuel.”
Keywords: Alpha-Omega Jet Fuel

Sentence: Barack Obama will visit Japan on Memorial Day.
Keywords: Memorial Day

Sentence: The metatarsals are attached to the cuneiform bones at a diagonal angle.
Keywords: metatarsal, cuneiform

Sentence: The HTTP protocol is used for sending and receiving HTML requests over the internet.
Keywords: HTTP protocol, HTML request

Sentences:

  1. Happy Birthday John! I got you this memory-foam mattress for the occasion.
  2. The band Green Day usually uses Vox foot pedals with heavy reverb and oscillatory delay.
  3. The Amazon rainforest has been heavily damaged by ammonium acetate in the form of acid rain.
  4. The young boy’s occlusion was malaligned and he was in dire need of a surgical implant.
  5. It was the happiest day in the world, except for a V1 Tesla Roadster which had crashed into an embankment.

Keywords:

  1. memory-foam
  2. Vox foot pedals, oscillatory delay
  3. ammonium acetate
  4. occlusion
  5. V1 Tesla Roadster

Sentences:

  1. After all the Shizuku Miyatsu that he had practiced, he had become serene.
  2. The left tetrometer was fully engaged at the time of the incident.
  3. She underwent a tetralogy of fallot surgery early in life.
  4. The corneal gland is inflamed due to lack of lipids.
  5. A gristmill mills grain with the assistance of vanes which are called blades.

Keywords:
1.

Unfortunately, GPT-3 returned the stop sequence in the playground, so it will need at least a little adjusting.

I can think of a few adjustments.

I can increase the temperature.

I can eliminate the stop sequence.

I can provide higher-quality data. These examples are made up, but I can provide real-world sentences with actual terminology to be extracted.

I can provide even more examples.

I will continue to tinker with this, but in case anybody wants to collaborate, I’d be interested in hearing your suggestions.

Thanks very much.

Here’s a brief update for anyone interested in knowing how to do this or collaborating in making it work:

I decided to use @daveshapautomator 's template for a keyword extractor.

A first draft was unsuccessful possibly because of a small amount of source text, so I posted the introduction to a Wikipedia article and provided about 8 examples of terminology:

Extract terminology from the following passage:

Passage:

In mathematics, topology (from the Greek words τόπος, ‘place, location’, and λόγος, ‘study’) is concerned with the properties of a geometric object that are preserved under continuous deformations, such as stretching, twisting, crumpling, and bending; that is, without closing holes, opening holes, tearing, gluing, or passing through itself.

A topological space is a set endowed with a structure, called a topology, which allows defining continuous deformation of subspaces, and, more generally, all kinds of continuity. Euclidean spaces, and, more generally, metric spaces are examples of a topological space, as any distance or metric defines a topology. The deformations that are considered in topology are homeomorphisms and homotopies. A property that is invariant under such deformations is a topological property. Basic examples of topological properties are: the dimension, which allows distinguishing between a line and a surface; compactness, which allows distinguishing between a line and a circle; connectedness, which allows distinguishing a circle from two non-intersecting circles.

The ideas underlying topology go back to Gottfried Leibniz, who in the 17th century envisioned the geometria situs and analysis situs. Leonhard Euler’s Seven Bridges of Königsberg problem and polyhedron formula are arguably the field’s first theorems. The term topology was introduced by Johann Benedict Listing in the 19th century, although it was not until the first decades of the 20th century that the idea of a topological space was developed.

Terminology:

  • topology
  • geometric object
  • continuous deformation
  • crumpling
  • gluing
  • topological space
  • subspace
  • continuity

The engine is DaVinci, the temperature is 0, the response length is 100, and the stop sequence is “enter enter”.

The response is good enough for this tool to probably have some kind of applicability:

  • homeomorphism
  • homotopy
  • dimension
  • compactness
  • connectedness

I was happy to find that GPT-3 produces this response consistently, giving me a feeling that the tool is predictable and reliable.

My only sense of it needing more is that I think there is more good terminology there to be extracted.

Next, I’m going to try a longer passage and writing more terminology examples relative to the passage, to see if I can get GPT-3 to return more terminology itself.

1 Like

You can also try to use different TOP_P as well as adjectives to describe the kind of terminology. So for instance you could specify “academic jargon” or “named entities” or “rare terms”. Remember that GPT-3 has a bigger vocabulary than any single human so you can use lots of adjectives.

Is there any known reason why TOP_P might have a particular effect vs. temperature or is it just up to chance to see what the result is?

TOP_P changes the distribution of tokens that it is willing to try. Higher TOP_P will give it access to all tokens and thus will allow for more rare terminology to be extracted. For instance, you might have an exceptionally rare term that you want to find, meaning you need a TOP_P of 1.0. If, however, you are looking only for mundane terms, you might want to turn it down to 0.1 or 0.05.

I see.
So TOP_P changes the “distribution” of tokens it will try.
So we can think of tokens as “possible responses”. How specifically does the “distribution” change? It just allows more tokens that are graded lower statistically as being a likely candidate?
But how will that interact with temperature?
For example, could we say that TOP_P changes the choices it has, whereas temperature changes the choice it will make given those options? (Risky or conservative.)
In that case it sounds like low temperature with high TOP_P could make sense - extract unusual terms but select them with rigor and definiteness (I mean, this is a vague approximation of how GPT-3 may or may not work / think.)
On the other hand, maybe for a QA bot, we maybe could imagine wanting low TOP_P, low temperature if we wanted it to go for the most conventional answer possible; and it would be interesting to compare low TOP_P with high temperature vs low temperature with high TOP_P. Maybe the former would be accurate but expressive, the latter more adventurously creative and insightful for an objective informational question yet less certain to be correct.

I don’t know if this is right but I’m trying to get a sense for what these parameters mean.

I just tried asking GPT-3 “Who is the character Legolas?” with both low and then one or the other high. If TOP_P is low, temperature doesn’t change the result. I assume this means it is given so few choices a risky or a conservative chooser choose the same thing. (My guess.)

1 Like