Need help regarding a LLM project

I have a dataset of 400 resumes in .txt format. I want to build a model that can generate responses to specific candidate queries like ‘Tell me the skillset of XYZ,’ but also handle generic queries like ‘Tell me the names of people who went to Ivy League schools.’ While RAG using OpenAI works well for candidate-specific queries, it struggles with generic ones.
What should be my further approach?

What is an example of a type of query you’re trying to do that’s not working well?

Hi, very interesting. But kind of complicated, because these queries are more like aggregation of facts about the candidates.

So personally, I would take the following approach:

1 format resumes to a more or less standard format.
2. Slice them into semantic chunks.
3. Extract entities from the resumes
4. Extract facts about the candidates
5. Create a vector database where I would store candidates, resumes, chunks from resumes, involved entities, facts
6. Think of a way to convert the human queries into “search approaches” resulting in a workflow to build queries to those tables to get context and train a model to perform the task of formulating queries to vector db.
7. Think of a prompt format for your answering model.
8. Train a model that would use the context from db to answer questions about one candidate at a time.
9. Train another model that would “aggregate” the multiple results from step #8 into final answer.

Here is the logic:

Supposing #1-#5 are done

Q: Tell me the names of people who went to Ivy League schools?

#6 → query entities for candidates related to Ivy League → from candidates results, get the schools each of them attended

#7 … Based on the provided context could you confirm this candidate attended Ivy League school? + Candidate context

#8 run prompt from #7 and get an answer in expected format for processing in step #9

#9 run prompt: based on the provided context please answer the question: $Q.

Then, it’s just a rough scheme, I’m sure you’ll have fun :blush:

1 Like