Best way to refine source data for LlamaIndex

Ralpharama · May 30, 2023, 12:06pm

I’m working on a bot for one of our websites, there’s a large amount of knowledge hidden in a decade of articles and news. So far I’ve just done a DB dump of a recent period and saved it as a CSV of title, date, intro, body text and reference URL on the website.

I then used Python with LlamaIndex with SimpleCSVReader to parse the CSV into documents which I then save.

    SimpleCSVReader = download_loader("SimpleCSVReader")
    loader = SimpleCSVReader()
    documents = loader.load_data(file=Path(f'{folder}/{fname}'))

    index = GPTVectorStoreIndex.from_documents(
                documents, 
                llm_predictor = llm_predictor, 
                prompt_helper = prompt_helper)

    index.storage_context.persist(persist_dir=".") # save in current directory

I can then query fine using a custom prompt that says to only use the context provided for answers.

This works pretty well, it’s done a great job of becoming pretty clever about the subject matter, an expert!

Here’s where I’m stuck though. The SimpleCSVReader is quite opaque, I don’t know how it presents the CSV as documents - just all concatenated together in a big text string?

My company wants the replies to contain links to resources on our website whenever possible, and if the URL is available for each page/document I figure this should be possible, right? But even if I tell the custom prompt to include URLs it won’t. But if I ask a question to ‘provide me with links about X’, then it provides a list that is correct!

I feel like I’m not formatting or structuring the source data correctly to make this happen.

I’m thinking I should be splitting the csv into separate documents then using LlamaIndex’s SimpleDirectoryReader? And this way I can customise the documents? But what can you customise? What format should it be in? I can’t any examples of adding embeddings or manipulating the Document object at all

Or should I convert them to json like training data in one big file? Something like this??

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

I’m a bit confused as I can’t find anything about the document format on LlamaIndex and if it’s the same as OpenAI’s training data?

And finally, how in either of these systems would I encourage the replies to contain references to their source?

Any guidance and help would be very much appreciated! Thanks!

(sorry I can’t include links as it’s my first post!)

sps · May 31, 2023, 11:26am

Hi @Ralpharama

I haven’t used LlamaIndex but if I understood correctly, you use-case is factual answers (including references, links etc). In this cases embeddings is better than fine-tune and much economical as well.

You’ll have to map the individual pages to the links, so that when the relevant pages are retrieved, you can pass along the links mapped to them as response.

Ralpharama · June 1, 2023, 9:49am

Thanks @sps , I’ll give that page a read, much appreciated.

Topic		Replies	Views
Is it possible to fine-tune a model to answer questions given a raw text? Prompting	18	10443	December 15, 2023
How to fine tune a chatbot for Q&A API	12	8692	December 16, 2023
How to create FAQ on internal company data? API	6	4904	December 18, 2023
GPT-3.5-turbo fine-tuning plus document retrieval Documentation fine-tuning	7	3804	November 12, 2023
How can I fine tune gpt3.5 to be able to read documentation and also books? API	8	2513	December 7, 2023

Best way to refine source data for LlamaIndex

Related topics