Keyword extraction from given text using vector DB

Hi,
I’m a bit struggling with a use case: extract keywords from a given text. Maybe I’m doing something wrong.
I have a set of keywords. Each keyword is a document in ChromaDB (added using OpenAIEmbeddings). Then I have RetrievalQA (LLM gpt-3.5-turbo, Chroma as retriever) which I use to prompt something like “Based on a given context, extract all keywords from the following text…”.
But the responses are too often out of context. When I use the option “return source documents”, then the returned docs are almost always out of context.

Is this generally a good practice or should I take a different approach?

Hi and welcome to the developer forum!

I’m not exactly sure what you are trying to do, you mention that a keyword is a document? Should it a keyword not be a word? Can you perhaps show some examples of your use case, and perhaps give some examples of what a typical call flow would be? along with the prompts given and the data returned by the various API calls made.

1 Like

A keyword (in this case) is mostly a single word but it can contain multiple words (AFAIK up to 3-5 words) as well. Maybe the term “tag” would be more clear. Let’s say I want to extract all skills from a job description. A skill can be eg. Java, C# … as well as “software development”, “project development” etc. I have a database of skills (ID, title, …), so I loaded all skills into a Chroma:

import { OpenAIEmbeddingFunction, ChromaClient } from 'chromadb';

const embedder = new OpenAIEmbeddingFunction({ openai_api_key: OPENAI_API_KEY });
const client = new ChromaClient();

const collection = await client.getOrCreateCollection({
  name: 'skills',
  embeddingFunction: embedder
});

// for each skill
await collection.add({
  ids: [skill.id.toString()],
  metadatas: [skill],
  documents: [skill.title],
});

Then, on the other side I want to perform the extraction:

import { RetrievalQAChain } from 'langchain/chains';
import { ChatOpenAI } from 'langchain/chat_models/openai';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { Chroma } from 'langchain/vectorstores/chroma';

const embedding = new OpenAIEmbeddings({ openAIApiKey: OPENAI_API_KEY });
const vectordb = new Chroma(embedding, { collectionName: 'skills' });

const chain = RetrievalQAChain.fromLLM(
  new ChatOpenAI({ modelName: 'gpt-3.5-turbo', maxTokens: 2000, openAIApiKey: OPENAI_API_KEY }),
  vectordb.asRetriever(),
  { returnSourceDocuments: false }
);

const { text, sourceDocuments } = await chain.call({
  query: `Based on a given context, extract all keywords from the following text: """
  Our Requirements:

  - Bachelor degree in Electrical Engineering or Computer Engineering or equivalent knowledge in the application of software engineering principles, theories, concepts, and techniques
  - Knowledge of C++ and Python
  - 4 + experience with software development
  - Fluent in English
  """`
});

Simple prompt with no other instructions.
Sample result:

Keywords extracted from the text:

- Requirements
- Bachelor degree
- Electrical Engineering
- Computer Engineering
- equivalent knowledge
- application
- software engineering principles
- theories
- concepts
- techniques
- C++
- Python
- experience
- software development
- Fluent
- English
[
  Document { pageContent: 'Data Extraction', metadata: {} },
  Document { pageContent: 'Text Mining', metadata: {} },
  Document { pageContent: 'Computer Engineering', metadata: {} },
  Document { pageContent: 'Software Engineering', metadata: {} }
]

There are only 4 real matches: Electrical Engineering, Computer Engineering, C++, Python, software development
All other keywords are completely out of the given context. Especially eg. “Requirements”, “Bachelor degree”, “Fluent” etc. The same for returned source documents: the source text does not say anything about “Data extraction” or “Text mining”.

So my question is mainly - is this the right approach?

1 Like

You can be a little less lazy and define what you need and do it completely without embeddings.

A CV is normally a small amount of data. so you can do it with prompt engineering like this:

"Take this data I somehow magically extracted from a CV and restructure it in the following way:

[CV - raw]

{
“firstname” : “”,
“lastname”: “”,

“experiences” : [ “skill1”: “”,…
}

You should also ask for average time on the job. If it is less then 5 years just return { “useless candidate” }

In your toolchain look for keyword matching algorithms like BM25.

I’m building my own BM25 variant called MIX. The motivation is to retrieve keyword intensive information directly, instead of only relying on embeddings.

I’m afraid this is a different use case - I need to extract keywords (skills) that are from a certain set of possibilities (a database of skills). The source of extraction is unstructured text (eg. job description)

How many key phrases/words are there? the models perform well with less than 10 usually, super well with 5 or less, you can think of every keyword/phrase as it’s own task, and the models do worse the more of them there are.

Potentially you could batch this, and look for a key group of 5 and then another key group of 5 in several passes, might be worth an explore.

Well, the database of skills contains ca. 5000 skills. More than 90% of skills are one or two words

1 Like

Ok, so you potentially have more skills than the the model can accept as a maximum of tokens per event, so that will not work. Need to have a think about how to reduce the search space to something manageable by the model, it can do a great job at natural language processing, but it is very bad at traditional database style tasks.

I think people are getting confused with your use of the term keywords – you should probably just say hashtags or tags

That said: From what you have described, this looks like a simple string search in the job description for your 5000 strings, no ? (I am sure python has some reverse search library where you can load the 5000 hashtags and then search)

Yes, you can do this task easily eg. with elasticsearch (fuzzy search, hunspel etc.) but it’s still a “stupid” form of extraction. You need a deep understanding of the source text.
For example, when the souce text contains “experience with database management is mandatory” then the skill “database” (or full “database management”) is okay. Especially when the text describes some IT job. But it’s not true for sentence like “clients database management” in case of some job in the Sales field.

1 Like

Then you should classify whatever you are trying to analyze first and use a knowledge graph or any kind of tree with attributes.

Could be done with SQL.

Create a wordbubble where the words are connected to synonyms of words.

Then remove stopwords.

Then calculate the synonym group density for all words that are left.

You do that on thousands of documents and then search for the ones with highest similarity and group them together (the wider you leave it the less groups you have).

And then you could use the combined word bubbles asking GPT-4 on how it would classify this group.

And then you take one of each group and label it manually as one of the category names (e.g. Backend Developer or Sales Manager).

Yeah, I know that’s really old school stuff but should still work like it did >15 years ago…

I mean for each new document that is added you will have a classification based on similarity then and it will automatically add that to the right group.

And from there you can go for skills that should be found in that group.

https://chat.openai.com/share/e37b825b-2ff9-4cc3-98f5-87083841d16e

you are welcome

Ah and here is someting I fiddled around with last year:

#!/usr/bin/env bash
set -x
gs -sDEVICE=tiffg4 -o “output.tiff” “1" tesseract "output.tiff" "test" -l eng --psm 6 hocr export OPENAI_API_KEY='.....' model=text-davinci-003 yaml="" #..... - create a yaml with structured CV and job description request="here you need a prompt that works.. you can do it! \n\n {yaml}”
hocrprompt=“{request//‘\n’/ }”
hocrprompt=“{hocrprompt//\\n/ }" hocrprompt="{hocrprompt//'/ }”
hocrprompt="{hocrprompt//\"/ }" response=(curl -s https://api.openai.com/v1/completions
-H “Content-Type: application/json”
-H “Authorization: Bearer $OPENAI_API_KEY”
-d “{"model": "model\", \"prompt\": \"{hocrprompt}", "temperature": 0.5, "max_tokens": 4096}”)
echo ${hocrprompt} | jq . >> ~/.openai-gpt-3-history
echo “;\n\n\n\n” >> ~/.openai-gpt-3-history
echo “$response” | jq . >> ~/.openai-gpt-3-history
echo “;\n\n\n\n” >> ~/.openai-gpt-3-history
echo “$response” | jq -r .choices[0].text

I mean it may not be the use case. But I bet GPT-4 already knows your keywords.