Document Index Creation Issue

I have created the google drive folder , added some demo word docs of my interest , using connector I have successfully authorized it .
Next step is to create index.json file using below code I have done that
index = GPTSimpleVectorIndex.from_documents( documents, service_context=service_context)

index.json created successfully.

Now my question is suppose ,

If we want to add new documents to existing folder, does the indexing have to be run on the all the documents in the folder from scratch or index can be updated to only include the newly added documents?

e.g. if we have an index created on 100 documents (say around 1,000 pages), and add one new document with 5 pages, do we have to recreate the entire index of 1,005 again?

Please help

Are you talking about a tutorial? Github code? We need more details, please.

This is basically question related custom bot based on documents saved in google drive folder

The python script I am using is

import os
import pickle
from langchain import OpenAI
from flask import Flask, render_template, request
from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext, download_loader,MockLLMPredictor,MockEmbedding
from langchain.chat_models import ChatOpenAI

os.environ[‘OPENAI_API_KEY’] = ‘sk-s0RIxaOb’

def authorize_gdocs():
google_oauth2_scopes = [
“https://www.googleapis.com/auth/drive.readonly”,
“https://www.googleapis.com/auth/documents.readonly”
]
cred = None
if os.path.exists(“token.pickle”):
with open(“token.pickle”, ‘rb’) as token:
cred = pickle.load(token)
if not cred or not cred.valid:
if cred and cred.expired and cred.refresh_token:
cred.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(“client_secrets.json”, google_oauth2_scopes)
cred = flow.run_local_server(port=0)
with open(“token.pickle”, ‘wb’) as token:
pickle.dump(cred, token)

authorize_gdocs()

GoogleDriveReader = download_loader(‘GoogleDriveReader’)
folder_id = ‘1AuhkobVmt0Et0lIrEU0swvwavwXRtJYi’
loader = GoogleDriveReader()
documents = loader.load_data(folder_id=folder_id)

Define LLM

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name=“gpt-3.5-turbo”))

Define prompt helper

max_input_size = 4096
num_output = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_output,max_chunk_overlap,chunk_size_limit=chunk_size_limit)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

Create index from documents only one time then comment it

ndex = GPTSimpleVectorIndex.from_documents(
documents, service_context=service_context
)
Save your index to a index.json file
index.save_to_disk(‘index1.json’)

index = GPTSimpleVectorIndex.load_from_disk(‘index1.json’)

#flask app Code
app = Flask(name)

@app.route(‘/’)
def home():
return render_template(‘index.html’)

@app.route(‘/query’, methods=[‘POST’])
def query():
prompt = request.form[‘prompt’]
print(‘prompt given issss:’,prompt)
response = index.query(prompt)
print(‘Responseeeeee:’,response)
# print(‘last token usedd’,llm_predictor.last_token_usage)
return render_template(‘index.html’, prompt=prompt,response=response)

if name == ‘main’:
app.run(debug=True)

If that’s your entire code, it doesn’t appear to save what it’s indexed and only index new at a quick glance.

Where did you grab the code from?

Didn’t get you

I have stored the index on one file index1.json

Now suppose new docs are added in the folder id mentioned do I need to create the index from scratch or for only newly added files

Code have customized as per my use case no direct code is available

You’ll need to re-run every time documents are added/changed

1 Like