Yes I would start with the Assistant API since you are willing to implement RAG yourself eventually.
The main differences in Assistant and your own RAG would be control and cost.
If you implement a RAG yourself, use your own local or cloud resources for computing the correlation, and pulling in the text from a database, you can do this very cheaply, and control how many tokens (history/context) is sent per API call.
Whereas current Assistant only has truncation as a strategy, currently little control, and your costs could get very high because of the large buffer size available, the conversation just adds and adds tokens, eventually becoming expensive unless you do something (start a new thread?). But I expect this to improve over time, so don’t think of this as a long term deterrent, but it will be felt short term until this is parametrized via the API.
A larger deterrent from using the Assistant API is the limitations on how much content you can store, search, and retrieve. So if you start heading into large amounts of content, you would want to move to your own RAG, since data and compute is readily available and cheap.
But watch out for expensive vector database vendors, they can add up and are not necessary if you are looking at less than, say, 1 million vectors or so, on a sporadic basis.
In this case, better to code this by hand using basic linear search using python and numpy. And you can always split the vectors into chunks and run them in parallel to hit your latency requirements.
Currently can do 400,000 vectors / per second / per worker in AWS Lambda. And this can scale to more workers just by splitting the binary file containing the embedding vectors.