Best way to refine source data for LlamaIndex

I’m working on a bot for one of our websites, there’s a large amount of knowledge hidden in a decade of articles and news. So far I’ve just done a DB dump of a recent period and saved it as a CSV of title, date, intro, body text and reference URL on the website.

I then used Python with LlamaIndex with SimpleCSVReader to parse the CSV into documents which I then save.

    SimpleCSVReader = download_loader("SimpleCSVReader")
    loader = SimpleCSVReader()
    documents = loader.load_data(file=Path(f'{folder}/{fname}'))

    index = GPTVectorStoreIndex.from_documents(
                llm_predictor = llm_predictor, 
                prompt_helper = prompt_helper)

    index.storage_context.persist(persist_dir=".") # save in current directory

I can then query fine using a custom prompt that says to only use the context provided for answers.

This works pretty well, it’s done a great job of becoming pretty clever about the subject matter, an expert!

Here’s where I’m stuck though. The SimpleCSVReader is quite opaque, I don’t know how it presents the CSV as documents - just all concatenated together in a big text string?

My company wants the replies to contain links to resources on our website whenever possible, and if the URL is available for each page/document I figure this should be possible, right? But even if I tell the custom prompt to include URLs it won’t. But if I ask a question to ‘provide me with links about X’, then it provides a list that is correct!

I feel like I’m not formatting or structuring the source data correctly to make this happen.

I’m thinking I should be splitting the csv into separate documents then using LlamaIndex’s SimpleDirectoryReader? And this way I can customise the documents? But what can you customise? What format should it be in? I can’t any examples of adding embeddings or manipulating the Document object at all :frowning:

Or should I convert them to json like training data in one big file? Something like this??

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

I’m a bit confused as I can’t find anything about the document format on LlamaIndex and if it’s the same as OpenAI’s training data?

And finally, how in either of these systems would I encourage the replies to contain references to their source?

Any guidance and help would be very much appreciated! Thanks!

(sorry I can’t include links as it’s my first post!)

1 Like

Hi @Ralpharama

I haven’t used LlamaIndex but if I understood correctly, you use-case is factual answers (including references, links etc). In this cases embeddings is better than fine-tune and much economical as well.

You’ll have to map the individual pages to the links, so that when the relevant pages are retrieved, you can pass along the links mapped to them as response.

1 Like

Thanks @sps , I’ll give that page a read, much appreciated.