Document Index Creation Issue

developer1 · May 2, 2023, 4:59am

I have created the google drive folder , added some demo word docs of my interest , using connector I have successfully authorized it .
Next step is to create index.json file using below code I have done that
index = GPTSimpleVectorIndex.from_documents( documents, service_context=service_context)

index.json created successfully.

Now my question is suppose ,

If we want to add new documents to existing folder, does the indexing have to be run on the all the documents in the folder from scratch or index can be updated to only include the newly added documents?

e.g. if we have an index created on 100 documents (say around 1,000 pages), and add one new document with 5 pages, do we have to recreate the entire index of 1,005 again?

Please help

PaulBellow · May 2, 2023, 5:47am

Are you talking about a tutorial? Github code? We need more details, please.

developer1 · May 2, 2023, 5:56am

This is basically question related custom bot based on documents saved in google drive folder

The python script I am using is

import os
import pickle
from langchain import OpenAI
from flask import Flask, render_template, request
from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext, download_loader,MockLLMPredictor,MockEmbedding
from langchain.chat_models import ChatOpenAI

os.environ[‘OPENAI_API_KEY’] = ‘sk-s0RIxaOb’

def authorize_gdocs():
google_oauth2_scopes = [
“https://www.googleapis.com/auth/drive.readonly”,
“https://www.googleapis.com/auth/documents.readonly”
]
cred = None
if os.path.exists(“token.pickle”):
with open(“token.pickle”, ‘rb’) as token:
cred = pickle.load(token)
if not cred or not cred.valid:
if cred and cred.expired and cred.refresh_token:
cred.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(“client_secrets.json”, google_oauth2_scopes)
cred = flow.run_local_server(port=0)
with open(“token.pickle”, ‘wb’) as token:
pickle.dump(cred, token)

authorize_gdocs()

GoogleDriveReader = download_loader(‘GoogleDriveReader’)
folder_id = ‘1AuhkobVmt0Et0lIrEU0swvwavwXRtJYi’
loader = GoogleDriveReader()
documents = loader.load_data(folder_id=folder_id)

Define LLM

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name=“gpt-3.5-turbo”))

Define prompt helper

max_input_size = 4096
num_output = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_output,max_chunk_overlap,chunk_size_limit=chunk_size_limit)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

Create index from documents only one time then comment it

ndex = GPTSimpleVectorIndex.from_documents(
documents, service_context=service_context
)
Save your index to a index.json file
index.save_to_disk(‘index1.json’)

index = GPTSimpleVectorIndex.load_from_disk(‘index1.json’)

#flask app Code
app = Flask(name)

@app.route(‘/’)
def home():
return render_template(‘index.html’)

@app.route(‘/query’, methods=[‘POST’])
def query():
prompt = request.form[‘prompt’]
print(‘prompt given issss:’,prompt)
response = index.query(prompt)
print(‘Responseeeeee:’,response)
# print(‘last token usedd’,llm_predictor.last_token_usage)
return render_template(‘index.html’, prompt=prompt,response=response)

if name == ‘main’:
app.run(debug=True)

PaulBellow · May 2, 2023, 5:59am

If that’s your entire code, it doesn’t appear to save what it’s indexed and only index new at a quick glance.

Where did you grab the code from?

developer1 · May 2, 2023, 6:04am

Didn’t get you

I have stored the index on one file index1.json

Now suppose new docs are added in the folder id mentioned do I need to create the index from scratch or for only newly added files

Code have customized as per my use case no direct code is available

developer1 · May 2, 2023, 9:00am

github.com

jerryjliu/llama_index/blob/main/examples/paul_graham_essay/InsertDemo.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "46e5110c-ed35-463e-a9f6-cff9cda6221b",
   "metadata": {},
   "source": [
    "This notebook showcases the insert capabilities of different GPT Index data structures.\n",
    "\n",
    "To see how to build the index during initialization, see `TestEssay.ipynb` instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ef7a7a6-dc10-4d94-8bdd-22b4954d365a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# My OpenAI Key\n",

This file has been truncated. show original

novaphil · May 3, 2023, 4:25am

You’ll need to re-run every time documents are added/changed

Topic		Replies	Views
Issues creating index from json array API	11	3241	May 11, 2023
How do deal with the file API and rapidly changing data API	3	556	June 8, 2021
API not working? Prompting	5	2209	January 30, 2024
GPT 3 Semantic Search API	2	781	December 19, 2023
CHAT-GPT Search API For Document Upload API	8	30207	December 12, 2023

Document Index Creation Issue

Define LLM

Define prompt helper

Create index from documents only one time then comment it

Related topics