A keyword (in this case) is mostly a single word but it can contain multiple words (AFAIK up to 3-5 words) as well. Maybe the term “tag” would be more clear. Let’s say I want to extract all skills from a job description. A skill can be eg. Java, C# … as well as “software development”, “project development” etc. I have a database of skills (ID, title, …), so I loaded all skills into a Chroma:
import { OpenAIEmbeddingFunction, ChromaClient } from 'chromadb';
const embedder = new OpenAIEmbeddingFunction({ openai_api_key: OPENAI_API_KEY });
const client = new ChromaClient();
const collection = await client.getOrCreateCollection({
name: 'skills',
embeddingFunction: embedder
});
// for each skill
await collection.add({
ids: [skill.id.toString()],
metadatas: [skill],
documents: [skill.title],
});
Then, on the other side I want to perform the extraction:
import { RetrievalQAChain } from 'langchain/chains';
import { ChatOpenAI } from 'langchain/chat_models/openai';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { Chroma } from 'langchain/vectorstores/chroma';
const embedding = new OpenAIEmbeddings({ openAIApiKey: OPENAI_API_KEY });
const vectordb = new Chroma(embedding, { collectionName: 'skills' });
const chain = RetrievalQAChain.fromLLM(
new ChatOpenAI({ modelName: 'gpt-3.5-turbo', maxTokens: 2000, openAIApiKey: OPENAI_API_KEY }),
vectordb.asRetriever(),
{ returnSourceDocuments: false }
);
const { text, sourceDocuments } = await chain.call({
query: `Based on a given context, extract all keywords from the following text: """
Our Requirements:
- Bachelor degree in Electrical Engineering or Computer Engineering or equivalent knowledge in the application of software engineering principles, theories, concepts, and techniques
- Knowledge of C++ and Python
- 4 + experience with software development
- Fluent in English
"""`
});
Simple prompt with no other instructions.
Sample result:
Keywords extracted from the text:
- Requirements
- Bachelor degree
- Electrical Engineering
- Computer Engineering
- equivalent knowledge
- application
- software engineering principles
- theories
- concepts
- techniques
- C++
- Python
- experience
- software development
- Fluent
- English
[
Document { pageContent: 'Data Extraction', metadata: {} },
Document { pageContent: 'Text Mining', metadata: {} },
Document { pageContent: 'Computer Engineering', metadata: {} },
Document { pageContent: 'Software Engineering', metadata: {} }
]
There are only 4 real matches: Electrical Engineering, Computer Engineering, C++, Python, software development
All other keywords are completely out of the given context. Especially eg. “Requirements”, “Bachelor degree”, “Fluent” etc. The same for returned source documents: the source text does not say anything about “Data extraction” or “Text mining”.
So my question is mainly - is this the right approach?