I’m attempting to use a prompt to match an organisation’s need (typically a charity) with a grant maker. Grant makers all have vastly different criteria for what kind of projects they will fund and there is no standard way for them to articulate that.
We eliminate unsuitable grants based on geography and grant size. Then we are looking to use OpenAI to match a brief plain text explanation from the organisation to the top three options amongst grant makers.
We have no problem doing this using the API and text-davinci-003.
The problem will be when we scale this. Each grant maker’s criteria runs ato around 500 words of text. We are testing with just 4. But when we load hundreds, we are sure to exceed the token limit.
ChatGPT suggested the dataset endpoint but it was either guessing or this appears deprecated. We have also tried uploading the grant maker text to the files endpoint but that only seems to accept fine-tune data. The other approach we have tried is to put the data on a plain text webpage (https://test.charityradar.uk/grant/test.txt) and ask the API to read that but no success there either.
Can you suggest an approach to this problem that is scalable?
In case it makes a difference, we are using Perl to connect to the API…
That sounds like a great project! I suggest you try to use embedding (semantic search) to find the top few most relevant grants based on the similarity between the project description and the grant criteria, then you can feed these top grants to GPT. Here is a tutorial:
This may not work if the project descriptions are written in a way that is very different from the grant criteria. In that case, it may be necessary to use ChatGPT to preprocess the project description and the grant criteria into a more similar format before the semantic search.
Just to be clear, if I use embeddings as suggested, these are not persistent - is that right?
The API generates a vector that I store and then need to do calculate the Cosine Similarity myself.
If you use a cloud service like pinecone, you will upload the grant criteria as documents with embedding vectors and associated metadata. They are persistently stored on pinecone for you. Then for each new project, you also calculate the embedding for it, then use that embedding to query pinecone. Pinecone will run the cosine-similarity search for you and return the top relevant documents.