Hi,
I’m trying Llamaindex to create data input for GPT from some google docs.
My goal is to have a chatbot that use my knowledge base to give me answers.
This is the code that I’m using:
def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index.save_to_disk('index.json')
return index
I opened the output file to undestrand better what’s going on, I’ve seen the data parsed in a very unusual way… I don’t really comprehend the structure of the data. For example some paragraphs is splitted between different indexes.
I then tried chatgpt with these datas and… it kinda sucked.
So my question here is: would chatgpt perform better if I’m able to structure better the data? For example I think is easier to have data divided by paragraph.
Or maybe use FAQs instead of user manuals, so the answers is shorter than entire paragraphs. In this way I can have each index with the question and the corrisponding answer for example
What do you think?