I’m working on a bot for one of our websites, there’s a large amount of knowledge hidden in a decade of articles and news. So far I’ve just done a DB dump of a recent period and saved it as a CSV of title, date, intro, body text and reference URL on the website.
I then used Python with LlamaIndex with SimpleCSVReader to parse the CSV into documents which I then save.
SimpleCSVReader = download_loader("SimpleCSVReader")
loader = SimpleCSVReader()
documents = loader.load_data(file=Path(f'{folder}/{fname}'))
index = GPTVectorStoreIndex.from_documents(
documents,
llm_predictor = llm_predictor,
prompt_helper = prompt_helper)
index.storage_context.persist(persist_dir=".") # save in current directory
I can then query fine using a custom prompt that says to only use the context provided for answers.
This works pretty well, it’s done a great job of becoming pretty clever about the subject matter, an expert!
Here’s where I’m stuck though. The SimpleCSVReader is quite opaque, I don’t know how it presents the CSV as documents - just all concatenated together in a big text string?
My company wants the replies to contain links to resources on our website whenever possible, and if the URL is available for each page/document I figure this should be possible, right? But even if I tell the custom prompt to include URLs it won’t. But if I ask a question to ‘provide me with links about X’, then it provides a list that is correct!
I feel like I’m not formatting or structuring the source data correctly to make this happen.
I’m thinking I should be splitting the csv into separate documents then using LlamaIndex’s SimpleDirectoryReader? And this way I can customise the documents? But what can you customise? What format should it be in? I can’t any examples of adding embeddings or manipulating the Document object at all
Or should I convert them to json like training data in one big file? Something like this??
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
I’m a bit confused as I can’t find anything about the document format on LlamaIndex and if it’s the same as OpenAI’s training data?
And finally, how in either of these systems would I encourage the replies to contain references to their source?
Any guidance and help would be very much appreciated! Thanks!
(sorry I can’t include links as it’s my first post!)