Disambiguate terms before RAG process

dariomac · May 23, 2024, 10:58pm

Hi! I’m working on a very specific RAG to answer questions about native trees.

Basically, I’m going to enter a lot of data that talks about trees using the scientific names of the trees to avoid ambiguity. My problem is that the RAG users are ordinary people, so instead of using the scientific names in the questions, they will use the common names.

Dictionary of scientific - common names:
Coronilla - Scutia buxifolia (only in Uruguay)
Coronillo - Scutia buxifolia (only in Argentina)

Arrayan - Blepharocalyx salicifolius (only in Uruguay)
Arrayan - Luma apiculata (only in Argentina)

So, I need a way to answer these kind of questions:

How to prune a Coronilla? (and should use Scutia Buxifolia information)
How to prune a Coronillo? (the same as above)
I’m in Uruguay. How to prune a Arrayan? (should use Blepharocalyx salicifolius)
I’m in Argentina. How to prune a Arrayan? (should use the Luma Apiculata information)

How can I do to achieve this?

jr.2509 · May 24, 2024, 5:06am

Welcome to the Forum!

I may not appreciate the full scope of the problem but whatyou could try is to add the respective country as metadata. This would enable you to pre-filter results based on a user’s location and narrow down the choices. Of course, as part of the operation you would need to extract the user location from the query but that should be pretty straightforward.

dariomac · May 24, 2024, 12:17pm

Thanks Shawn! Beside your answer is a Chat-GPT4o copy & paste, it’s a good idea splitting the common names by country. I’ve already tested it and works great.

However, my question was more intended to find out if I was leaving out something obvious, some kind of strategy that is common to those who have been working with ML/AI longer than I have.

A RAG has a “well stablished” process (basically ingest the raw data, then retrieve the relevant information using the user’s query and finally augument the LLM prompt).

In this process, how do you do to handle “synonyms” (i.e. you train a model using developed concepts that have common acronyms that are likely to be used by those asking questions). One strategy is replacing the common terms (e.g. acronyms, trees’ common names) using a dictionary (as you proposed)… Am I missing any other solutions?

dariomac · May 24, 2024, 12:24pm

Thanks Jen. Adding the country to the metadata may work for filtering in a small data universe. However, it does not solve the problem of a user searching using common tree names instead of scientific names.

A simplified example: let’s say the user has your country in the profile and searches for “How to prune a Arrayan”.

I can filter the vector database by the country in the metadata, but then nothing will match the term Arrayan because all the data is ingested using scientific terms. I could ingest data using common names, but there are a bunch of common names per tree… that’s why I want to use scientific names.

jr.2509 · May 24, 2024, 12:31pm

Right. The other thing you can of course do is to ensure that both terms - the scientific and non-scientific term - in the embedding. Normally this should help getting back relevant results during the similarity search.

_j · May 24, 2024, 1:28pm

The thing about AI to understand is its own understanding. When you use embeddings or language inference, AI already has a strong understanding of semantic meanings and synonyms.

You only need to extend on this with temporary concrete information to ensure answering.

I found a few PDFs and just uploaded to ChatGPT to make a tree GPT. Gave it instructions to be aware of local common names and translate to the correct species.

Then we try the samples above.

I’m in Argentina. How to prune a Arrayan?
(should use the Luma Apiculata information per first post)

The Arrayán, known scientifically as Luma apiculata, is a native tree in Argentina and other parts of South America. Pruning this tree is essential for maintaining its shape, health, and aesthetic appeal.
Steps for Pruning Arrayán (Luma apiculata)

Timing:
    The best time to prune an Arrayán is in late winter or early spring, just before new growth begins. This allows the tree to recover quickly and produce vigorous new growth.

Tools:
    Use clean, sharp pruning shears for small branches.
    Loppers for medium-sized branches.
    A pruning saw for larger branches.

Basic Pruning Principles:
    Sanitation: Always start by removing any dead, diseased, or damaged branches. This helps prevent the spread of disease and pests.
    Thinning: Thin out crowded areas to improve air circulation and light penetration, which can help prevent disease.
    Shaping: Shape the tree to maintain its natural form. Avoid drastic changes that can stress the tree.
    Suckers and Water Sprouts: Remove suckers (shoots growing from the base of the tree) and water sprouts (vigorous vertical shoots from the trunk or branches) as they appear.

Steps to Prune:
    Assess the Tree: Before cutting, step back and look at the overall shape and structure. Identify which branches need to be removed or trimmed.
    Remove Dead and Diseased Wood: Cut back to healthy wood or to the main branch or trunk.
    Thin Out: Selectively remove some branches that cross or are too close together to reduce density.
    Cut Back to Buds: When shortening branches, make cuts just above a bud facing the direction you want new growth to take.
    Maintain Shape: Follow the natural contour of the tree. If the tree is young, select a strong central leader (main upward branch) and remove competing leaders.

Aftercare:
    Water the tree well after pruning.
    Mulch around the base to retain moisture and suppress weeds.
    Monitor for any signs of stress or disease, and take action if necessary.

By following these steps, you can help ensure your Arrayán remains healthy and attractive. Regular pruning not only enhances the tree’s appearance but also promotes better air circulation and sunlight exposure, contributing to the tree’s overall health.

I’m in Uruguay. How to prune a Arrayan?

(should use Blepharocalyx salicifolius)

The Arrayán tree in Uruguay, commonly referred to as “Blepharocalyx salicifolius,” is known for its attractive and aromatic qualities. Pruning this tree is essential to maintain its shape, health, and promote better fruit production. Here’s a guide on how to prune an Arrayán (Blepharocalyx salicifolius):
Pruning Guide for Arrayán (Blepharocalyx salicifolius)

Timing:
    The best time to prune the Arrayán is during the late winter or early spring before new growth begins. This timing minimizes stress on the tree and promotes vigorous growth in the upcoming season.

Tools Needed:
    Clean, sharp pruning shears for small branches.
    Loppers for medium-sized branches.
    A pruning saw for larger branches.

Steps:
    Remove Dead or Diseased Wood:
        Start by cutting away any dead, diseased, or damaged branches. Make the cut at least 6 inches below the affected area into healthy wood.
    Thin Out Crowded Areas:
        Identify areas where branches are crowded and remove some to improve air circulation and light penetration. Focus on removing branches that are crossing or rubbing against each other.
    Shape the Tree:
        Prune to shape the tree, maintaining a natural form. Remove any branches that are growing inward toward the center of the tree.
        Trim back overextended branches to maintain a balanced shape. Make cuts just above a bud that is facing the direction you want the new growth to follow.
    Encourage Fruit Production:
        For better fruiting, thin out some of the older wood to encourage the growth of new, fruit-bearing branches. Cut these older branches back to the main trunk or to a lateral branch.
    Maintain the Height and Spread:
        If the tree is getting too tall or wide, reduce the height and spread by cutting back the longest branches to a lateral branch or a bud.
    Clean Up:
        Remove all pruned material from around the base of the tree to prevent disease and pests. Dispose of the pruned branches properly.

Additional Tips:

Always make clean cuts at a slight angle, just above a bud or branch junction.
Disinfect your tools between cuts, especially when dealing with diseased branches, to prevent the spread of pathogens.
Avoid heavy pruning in one season; it's better to spread major pruning tasks over a few years.

By following these guidelines, you can keep your Arrayán tree healthy, aesthetically pleasing, and productive.

Continuing the Uruguay chat, “How about Coronilla?”

In Uruguay, “Coronilla” typically refers to Scutia buxifolia, a native tree or shrub known for its dense foliage and thorny branches. Pruning Coronilla is important to manage its growth, maintain its shape, and enhance its health and vigor. Here’s a detailed guide on how to prune Coronilla (Scutia buxifolia):
Pruning Guide for Coronilla (Scutia buxifolia)

Timing:
    The best time to prune Coronilla is in late winter or early spring before new growth starts. This minimizes stress on the plant and promotes robust growth in the new season.

Tools Needed:
    Sharp, clean pruning shears for smaller branches.
    Loppers for medium-sized branches.
    Pruning saw for larger branches.
    Gloves and protective clothing to handle thorny branches.

Steps:
    Remove Dead, Diseased, or Damaged Wood:
        Begin by cutting away any dead, diseased, or damaged branches. Make clean cuts into healthy wood, at least 6 inches below the affected area.
    Thin Out Dense Growth:
        Identify and remove branches that are overcrowded to improve air circulation and light penetration. This helps reduce the risk of disease and promotes healthy growth.
    Shape the Plant:
        Prune to maintain a natural shape. Remove any branches growing inward or crossing over others. This prevents rubbing and potential injury to the branches.
    Control Size and Spread:
        If the Coronilla is getting too large, reduce its size by cutting back the longest branches to a lateral branch or bud. Aim to maintain a balanced shape.
    Promote New Growth:
        Encourage the growth of new shoots by cutting back some of the older, woodier branches. This stimulates the plant to produce fresh, vigorous growth.
    Sanitize Tools:
        Disinfect your pruning tools between cuts, especially when cutting diseased branches, to prevent the spread of pathogens.

Clean Up:
    Remove all pruned material from around the base of the plant to prevent disease and pests. Dispose of the cut branches properly.

Additional Tips:

Make cuts at a slight angle just above a bud or branch junction to encourage new growth in the desired direction.
Wear gloves and long sleeves to protect yourself from the thorns.
Avoid removing more than one-third of the plant’s overall size in a single pruning session to reduce stress on the plant.

By following these steps, you can keep your Coronilla (Scutia buxifolia) healthy, well-shaped, and thriving.

So the trick is not the species after giving it the rules and enough RAG that it’s going to get the scientific and common names (which it likely already knows, I didn’t try without documents) - it is how accurate and tailored the pruning advice actually is.

Enough system (GPT) instruction just to make the AI work:

You are a tree and horticulture expert.
Expect the user to use the simple name of plant species and genus, and you must discover and deduce which true scientific name of species the local custom refers to. Ask for clarification if needed.
Your myfiles_browser is not user uploads: it is filled with your tree knowledge, and a search should be performed for any plant questions.
You then give advice about specific species, citing both the common name and exact latin or scientific name to make clear you have decided on the correct species.

dariomac · May 24, 2024, 4:47pm

Hey Jay! First I want to thank you for the extensive and comprehensive response. I read it several times because there is a lot of interesting information in there.

However, I feel I must apologize because as you answer my questions, my original question is mutating.

I agree that given the right context (in your example the PDFs) GPT can give the right answer based on “localised” questions. I have tried it and it certainly works.

My problem (with the context of your answer ) now can be defined as… How do I choose the relevant information to send as context to GPT, as you did with the PDFs?

I have all the chunks loaded in a vector database, and obviously I have more chunks than context space, so I have to choose what to send… and those chunks use scientific names, so I still have my problem of converting the user query with common names to a query with scientific names, to find the closest vectors to send as context.

_j · May 24, 2024, 6:03pm

You have two choices to enhance the AI knowledge:

automatically enhance the AI knowledge with injection related to user question and some context;
have the AI call a function to search knowledge (which is what assistants offers).

In the first, a technique commonly used is HyDE - transforming what might just be the user input and perhaps turns before into a hypothetical answer with AI. You have another AI write the answer to the question only from pre-existing knowledge, and then send that for embedding to search instead. That AI could be prompted also to answer with species and common names.

In the second, you have the ability to write a function that has more instructions of what to search for, or guidance to produce an entire paragraph of searching.

Both of these additional AI steps require more time than simply performing embeddings on user input, but they can produce higher quality results by looking more like the chunks of knowledge for the vector database semantic search.

Then finally, the AI is smart enough to not necessarily answer from the wrong country’s plant definition as some of the knowledge.

razvan.i.savin · May 24, 2024, 10:01pm

Maybe this will help, you remind me about one project I had, is from the old course:
https://cs50.harvard.edu/ai/2020/projects/6/questions/

And here is an introduction:

In the world of text analysis and information retrieval, understanding the importance of words within documents is crucial. One widely used method to achieve this is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is a statistical measure that helps identify the significance of words in a document relative to a collection of documents (corpus). It is commonly used in search engines, text mining, and various natural language processing (NLP) applications.

Term Frequency (TF)

Term Frequency (TF) measures how often a word appears in a document. It helps to understand the importance of a word within that particular document. Common words in the document get higher scores. For example, if the word “data” appears frequently in a document, its term frequency will be high, indicating that it is an important term in that document.

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) measures how important a word is across multiple documents. It gives lower scores to words that appear in many documents (like “the” or “and”) and higher scores to words that are more unique to a few documents. This helps to reduce the weight of common words and emphasize unique ones. For instance, if the word “analysis” appears in only a few documents, it will have a high IDF score, highlighting its uniqueness.

TF-IDF

TF-IDF combines TF and IDF to find words that are important in a specific document but not too common in other documents. This helps to highlight the key terms that uniquely characterize the content of that document. By multiplying the term frequency by the inverse document frequency, TF-IDF balances the local importance of a term with its global uniqueness.

Practical Use

Imagine you have several documents and you want to find the most relevant words for each document. TF-IDF helps by scoring each word based on its importance in a document and its uniqueness across all documents. This way, you can identify the words that best represent the content of each document. For example, in a set of scientific papers, TF-IDF might highlight terms like “neural networks” or “quantum computing” as important, distinguishing them from common words like “experiment” or “results.”

Applications

Search Engines: To rank documents by how relevant they are to a user’s query. When you search for “machine learning,” documents where this term is both frequent and unique will rank higher.
Text Analysis: To extract key terms and features from documents for further analysis. This can help in summarizing articles or identifying main topics.
Recommendation Systems: To suggest documents or articles based on important terms. For instance, a news recommendation system might use TF-IDF to suggest articles that share key terms with articles a user has read previously.

By using TF-IDF, we can better understand and utilize the significance of words within documents, making it a powerful tool for analyzing and processing large amounts of textual data.

And you have to create a tool to enhance your AI with your information from your computer and you will always use latest updated information with that tool when you ask him to get some information with that tool. Is a little bit complex but is possible.

Topic		Replies	Views
Strategy to Train Model using R.A.G API training	5	2360	October 18, 2023
Vector database QnA answering based on info from multiple replies Prompting chatgpt	4	2713	September 25, 2023
Prompting with the chat/completions API against a large transcript file API	5	3560	October 4, 2023
Scaling RAG chatbot system to millions of documents API gpt-4 , prompt-engineering , rag	18	5878	February 28, 2024
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17604	December 17, 2023