Entity and relation extraction fine-tuning

I am currently working on a project which synthesizes text written in technical documents and scientific papers to provide a concise answer to questions, using only information contained within the text. So far I have managed to scrape .pdf files for content and split it into chunks, embed each of them and store them in a vector database. When the user asks a question, the database is queried and relevant chunks of text are extracted. I then created a GPT-3 prompt which summarizes the text to provide an answer, while making sure to avoid hallucination. This method somewhat works but is quite inefficient because of two reasons:

  1. Not all relevant data is always extracted. That is, if the top chunk mentions the name of an external source, without providing details about it (and assuming that is necessary information to be able to answer the question), the model will fail to provide an accurate answer.

  2. Prompt size is very large, resulting in a cost of around 2-4 cents per question. This is due to the fact that I provide the top 3 chunks (each 500 tokens long) to the model for it to synthesize an answer.

Recently, I have been looking into knowledge graphs and entity/relation extraction from text. It seems to me that building a knowledge graph is the most token-efficient way to query a database and provide relevant information to formulate an answer. I have so far experimented with various models for relation triple extraction, most of which do this on a sentence-to-sentence basis. That is, they do not extract relations which span more than one sentence, or even at a document level. Even though there are a couple of publicly available models which are trained for document-level relation extraction, they are trained to detect a fixed set of relations defined by a schema (and to modify this schema would mean to build a new dataset, and to retrain the model entirely).

To solve this, I am working on fine-tuning a GPT-3 model on a few samples from the DocRED dataset, which contains manually annotated chunks of text from wikipedia articles. From my understanding, this would likely perform document-level relation extraction far more accurately than other NLP models, and would allow for a more flexible schema if I somehow incorporate relation/entity types I want to extract in the prompt. I have so far only performed two fine-tunes, one on davinci-003, with 100 prompt:response pairs, and another on curie-003 with 200 prompt:response pairs. I have achieved mixed results from these. The biggest issue I have so far encountered is that both models tends to repeat relations a lot (see examples below), especially if the style of text differs from the content in the DocRED dataset.

Has anyone attempted to train a GPT-3 model on a relation extraction dataset? Does anybody have any advice to improve my fine-tunes (other than increasing the sample count to >=200)?

Extract from the DocRED dataset (formatted for GPT-3)
PROMPT: “ANALYZE: Malfunkshun : The Andrew Wood Story is a 2005 documentary film based on the life of Andrew Wood , singer / songwriter for the Seattle - based alternative rock bands Malfunkshun and Mother Love Bone . The film includes interviews with fellow Seattle musicians and friends such as Chris Cornell , Kim Thayil , Jeff Ament , and Stone Gossard . […] ->”
COMPLETION " (‘subject’: ‘Wood’, ‘relation’: ‘date of death’, ‘object’: ‘1990’),(‘subject’: ‘Stone Gossard’, ‘relation’: ‘member of’, ‘object’: ‘Mookie Blaylock’),(‘subject’: ‘Eddie Vedder’, ‘relation’: ‘member of’, ‘object’: ‘Mookie Blaylock’), […] ,##STOP##"

Inference with davinci-003
OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promoting and developing friendly AI in a way that benefits humanity as a whole. The organization was founded in San Francisco in late 2015 by Sam Altman, Elon Musk, and others, who collectively pledged US$1 billion. Musk resigned from the board in February 2018 but remained a donor. → (‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: ),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: ),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: ),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: ),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: [‘OpenAI LP’, ‘OpenAI LP’]),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: [‘OpenAI Inc.’, ‘OpenAI Inc.’]),(‘entity’: ‘OpenAI’, ‘type’: ‘ORG’, ‘aliases’: ),(‘entity’: ‘San Francisco’, ‘type’: ‘LOC’, ‘aliases’: ),(‘entity’: ‘in late 2015’, ‘type’: ‘TIME’, ‘aliases’: ),(‘entity’: ‘Sam Altman’, ‘type’: ‘PER’, ‘aliases’: ),(‘entity’: ‘Elon Musk’, ‘type’: ‘PER’, ‘aliases’: ),(‘entity’: ‘US$1 billion’, ‘type’: ‘NUM’, ’

1 Like

Here is a YouTube video of an example using knowledge graphs:

2 Likes

Hi @gonespral,
So far, I have processed over 250 medical research articles using the following approach:

  1. Breaking up the text into sentences locally.
  2. Finding named entities in an ontology and passing those as suggested entities to increase their chances of being selected as either the source or the target.
  3. Feeding davinci-003 model one sentence at a time with a prompt while giving it a template as to how to classify relations.
  4. Parsing the returned answer to check whether the answer is compliant with the allowed answers.
  5. Storing the triplets in a text file and passing them to neo4j database for analysis.

I am quite satisfied with the results. The cost per document is a bit below 40 cents, but I think this worth it considering the complexity of the medical texts. I can’t yet provide an F1 score of the model but I can say that based on crude analysis, the causal relations are extracted correctly in roughly 90-95% of the cases. That is the direction of the relation is extracted correctly.

I intend to try extracting relations spanning multiple sentences when chatGPT’s API comes out officially.

1 Like

Hi, your approach seems very much like what I’m doing right now and the results look compelling. I’m relying on gpt-4 to suggest the named entities and the relations - it does comparatively well however I haven’t tested it on a large scale yet. Have you published details about your approach elsewhere or do you mind sharing more here? For example at step 2 - is this the usual NER step, made with another tool. Then you supply the NER-extracted entities along with the sentence at step 3? How does the prompt look like? Thanks!

I’m curious. I’ve been working on a similar approach, and also getting promising results. At least I think so. But then I ask: What am I going to do with this?
Options I’ve come up with so far:

  1. Missing link prediction (std GCN question. ie, train a GCN on the extracted relation data, then query missing link and look at logits)
  2. Multi-hop. is entity1 related to entity2 (ie, query the transitive closure). I don’t know how to do this. Classic graph search? What I really want is for an LLM to be able to use this derived info…
    Ideas?