Extract reference list from a scientific publication through RAG

malee1382 · March 7, 2024, 10:26am

I am quite new to RAG implementations. What I am trying to is get a body text of a scientific publication and split it into chunks and index them. Then create a retrieval and try to get all references showing up under references section. Here is my code (though in the code below I use mistral, I also could not get full list when using gpt models):

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings

from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding

llm = MistralAI(api_key=mistral_api_key,model="mistral-small")
embed_model = MistralAIEmbedding(model_name='mistral-embed', api_key=mistral_api_key)

Settings.llm = llm
Settings.embed_model = embed_model


# maximum input size to the LLM
Settings.context_window = 4096

# number of tokens to leave room for the LLM to generate
Settings.num_output = 5000

# Settings.chunk_size = 512
# Settings.chunk_overlap = 50

# the text is coming from the link: https://www.mdpi.com/2227-7390/9/23/3034
documents = SimpleDirectoryReader("/kaggle/input/rag-sample/").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()


from llama_index.core.retrievers import VectorIndexRetriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
)

from llama_index.core import PromptTemplate

qa_prompt = PromptTemplate(
    """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: \
"""
)

query_str = (
    "list all the references mentioned under the references section"
)

retrieved_nodes = retriever.retrieve(query_str)

def generate_response(retrieved_nodes, query_str, qa_prompt, llm):
    context_str = "\n\n".join([r.get_content() for r in retrieved_nodes])
    fmt_qa_prompt = qa_prompt.format(
        context_str=context_str, query_str=query_str
    )
    response = llm.complete(fmt_qa_prompt)
    return str(response), fmt_qa_prompt

response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, Settings.llm
)

print(f"*****Response******:\n{response}\n\n")

This returns almost always incomplete list. When I check the texts in the retrieved_nodes, even though I see all the references are in it, different chunks containing references stay quite apart. Something like this (fmt_qa_prompt.split(“\n”)):

['Context information is below.\n---------------------\n[Google Scholar] [CrossRef]',
 'Guizzardi, A.; Pons, F.M.E.; Angelini, G.; Ranieri, E. Big data from dynamic pricing: A smart approach to tourism demand forecasting. Int. J. Forecast. 2021, 37, 1049–1060. [Google Scholar] [CrossRef]',
 'Priestley, M.B. Spectral Analysis and Time Series; Academic Press: London, UK, 1981; Volumes I and II. [Google Scholar]',
 'Jenkins, G.M.; Watts, D.G. Spectral Analysis and Its Applications; Holden-Day: San Francisco, CA, USA, 1986. [Google Scholar]',
 'Brockwell, P.; Davis, R. Time Series: Theory and Methods; Springer Series in Statistics; Springer: New York, NY, USA, 1987. [Google Scholar]',
 'Meynard, A.; Torrésani, B. Spectral estimation for non-stationary signal classes. In Proceedings of the 2017 International Conference on Sampling Theory and Applications (SampTA), Tallinn, Estonia, 3–7 July 2017; pp. 174–178. [Google Scholar]',
 'Bruscato, A.; Toloi, C.M.C. Spectral analysis of non-stationary processes using the Fourier transform. Braz. J. Probab. Stat. 2004, 18, 69–102. [Google Scholar]',
 'Priestley, M. Power spectral analysis of non-stationary random processes. J. Sound Vib. 1967, 6, 86–97. [Google Scholar] [CrossRef]',
 'Kantz, H. A robust method to estimate the maximal Lyapunov exponent of a time series. Phys. Lett. A 1994, 185, 77–87. [Google Scholar] [CrossRef]',
 'Wolf, A.; Swift, J.B.; Swinney, H.L.; Vastano, J.A. Determining Lyapunov exponents from a time series. Phys. D Nonlinear Phenom. 1985, 16, 285–317. [Google Scholar] [CrossRef] [Green Version]',
...
...
'. In Figure 10, we observe a sharp increase at the beginning of the curve for',
 '𝜀=2500',
 'that is not present in the other curves. This is probably due to measurement noise. Indeed, when',
 '𝜀',
 'is of the order of the noise level, some points that are inside the balls of radius',
 '𝜀',
 'would be outside the balls if the noise were suppressed. Then, their real distance is larger than',
 '𝜀',
...
...
'Willer, H.; Schaak, D.; Lernoud, J. Organic farming and market development in Europe and the European Union. In Organics International: The World of Organic Agriculture; FiBL; IFOAM—Organics International: Frick, Switzerland; Bonn, Germany, 2018; pp. 217–250. Available online: https://orgprints.org/id/eprint/31187/ (accessed on 22 November 2021).',
 'Selvaraj, J.J.; Arunachalam, V.; Coronado-Franco, K.V.; Romero-Orjuela, L.V.; Ramírez-Yara, Y.N. Time-series modeling of fishery landings in the Colombian Pacific Ocean using an ARIMA model. Reg. Stud. Mar. Sci. 2020, 39, 101477. [Google Scholar] [CrossRef]',
 'Wang, M. Short-term forecast of pig price index on an agricultural internet platform. Agribusiness 2019, 35, 492–497. [Google Scholar] [CrossRef]',
 'Mehmood, Q.; Sial, M.; Riaz, M.; Shaheen, N. Forecasting the Production of Sugarcane Crop of Pakistan for the Year 2018–2030, Using Box-Jenkin’s Methodology. J. Anim. Plant Sci. 2019, 5, 1396–1401. [Google Scholar]',...

As seen in the first part there are some references and then in between some other text and then again some references are showing up.

Even though I increase Settings.num_output = 5000, I could not get all the references.

What would be the best approach for the task? Is this where re-ranking should step in?

Topic		Replies	Views
How can RAG systems be improved for more complex queries API	3	2206	October 31, 2023
What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval? API chatgpt , rag	2	2533	May 15, 2024
Using RAG to enumerate all entries matching a condition Community embeddings , rag	2	435	February 8, 2024
How do you make rag without blowing costs? API rag	3	2658	January 9, 2024
How to use RAG properly and what types of query it is good at? GPT builders chatgpt	7	2404	March 4, 2024

Extract reference list from a scientific publication through RAG

Related Topics