Extract reference list from a scientific publication through RAG

I am quite new to RAG implementations. What I am trying to is get a body text of a scientific publication and split it into chunks and index them. Then create a retrieval and try to get all references showing up under references section. Here is my code (though in the code below I use mistral, I also could not get full list when using gpt models):

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings

from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding

llm = MistralAI(api_key=mistral_api_key,model="mistral-small")
embed_model = MistralAIEmbedding(model_name='mistral-embed', api_key=mistral_api_key)

Settings.llm = llm
Settings.embed_model = embed_model

# maximum input size to the LLM
Settings.context_window = 4096

# number of tokens to leave room for the LLM to generate
Settings.num_output = 5000

# Settings.chunk_size = 512
# Settings.chunk_overlap = 50

# the text is coming from the link: https://www.mdpi.com/2227-7390/9/23/3034
documents = SimpleDirectoryReader("/kaggle/input/rag-sample/").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

from llama_index.core.retrievers import VectorIndexRetriever
retriever = VectorIndexRetriever(

from llama_index.core import PromptTemplate

qa_prompt = PromptTemplate(
Context information is below.
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: \

query_str = (
    "list all the references mentioned under the references section"

retrieved_nodes = retriever.retrieve(query_str)

def generate_response(retrieved_nodes, query_str, qa_prompt, llm):
    context_str = "\n\n".join([r.get_content() for r in retrieved_nodes])
    fmt_qa_prompt = qa_prompt.format(
        context_str=context_str, query_str=query_str
    response = llm.complete(fmt_qa_prompt)
    return str(response), fmt_qa_prompt

response, fmt_qa_prompt = generate_response(
    retrieved_nodes, query_str, qa_prompt, Settings.llm


This returns almost always incomplete list. When I check the texts in the retrieved_nodes, even though I see all the references are in it, different chunks containing references stay quite apart. Something like this (fmt_qa_prompt.split(“\n”)):

['Context information is below.\n---------------------\n[Google Scholar] [CrossRef]',
 'Guizzardi, A.; Pons, F.M.E.; Angelini, G.; Ranieri, E. Big data from dynamic pricing: A smart approach to tourism demand forecasting. Int. J. Forecast. 2021, 37, 1049–1060. [Google Scholar] [CrossRef]',
 'Priestley, M.B. Spectral Analysis and Time Series; Academic Press: London, UK, 1981; Volumes I and II. [Google Scholar]',
 'Jenkins, G.M.; Watts, D.G. Spectral Analysis and Its Applications; Holden-Day: San Francisco, CA, USA, 1986. [Google Scholar]',
 'Brockwell, P.; Davis, R. Time Series: Theory and Methods; Springer Series in Statistics; Springer: New York, NY, USA, 1987. [Google Scholar]',
 'Meynard, A.; Torrésani, B. Spectral estimation for non-stationary signal classes. In Proceedings of the 2017 International Conference on Sampling Theory and Applications (SampTA), Tallinn, Estonia, 3–7 July 2017; pp. 174–178. [Google Scholar]',
 'Bruscato, A.; Toloi, C.M.C. Spectral analysis of non-stationary processes using the Fourier transform. Braz. J. Probab. Stat. 2004, 18, 69–102. [Google Scholar]',
 'Priestley, M. Power spectral analysis of non-stationary random processes. J. Sound Vib. 1967, 6, 86–97. [Google Scholar] [CrossRef]',
 'Kantz, H. A robust method to estimate the maximal Lyapunov exponent of a time series. Phys. Lett. A 1994, 185, 77–87. [Google Scholar] [CrossRef]',
 'Wolf, A.; Swift, J.B.; Swinney, H.L.; Vastano, J.A. Determining Lyapunov exponents from a time series. Phys. D Nonlinear Phenom. 1985, 16, 285–317. [Google Scholar] [CrossRef] [Green Version]',
'. In Figure 10, we observe a sharp increase at the beginning of the curve for',
 'that is not present in the other curves. This is probably due to measurement noise. Indeed, when',
 'is of the order of the noise level, some points that are inside the balls of radius',
 'would be outside the balls if the noise were suppressed. Then, their real distance is larger than',
'Willer, H.; Schaak, D.; Lernoud, J. Organic farming and market development in Europe and the European Union. In Organics International: The World of Organic Agriculture; FiBL; IFOAM—Organics International: Frick, Switzerland; Bonn, Germany, 2018; pp. 217–250. Available online: https://orgprints.org/id/eprint/31187/ (accessed on 22 November 2021).',
 'Selvaraj, J.J.; Arunachalam, V.; Coronado-Franco, K.V.; Romero-Orjuela, L.V.; Ramírez-Yara, Y.N. Time-series modeling of fishery landings in the Colombian Pacific Ocean using an ARIMA model. Reg. Stud. Mar. Sci. 2020, 39, 101477. [Google Scholar] [CrossRef]',
 'Wang, M. Short-term forecast of pig price index on an agricultural internet platform. Agribusiness 2019, 35, 492–497. [Google Scholar] [CrossRef]',
 'Mehmood, Q.; Sial, M.; Riaz, M.; Shaheen, N. Forecasting the Production of Sugarcane Crop of Pakistan for the Year 2018–2030, Using Box-Jenkin’s Methodology. J. Anim. Plant Sci. 2019, 5, 1396–1401. [Google Scholar]',...

As seen in the first part there are some references and then in between some other text and then again some references are showing up.

Even though I increase Settings.num_output = 5000, I could not get all the references.

What would be the best approach for the task? Is this where re-ranking should step in?