We’re having a dataset of approx. 2TB of historical IoT data (including geospatial data, coordinates only) which we like to query with OpenAI Api. The dataset is continuously growing with new data ingestions, approx. 3GB/day. At the moment the data is stored as CSV files, so we’re free to go any direction.
What would be a good scalable database solution? Can this be done in a Postgres, MySQL or do we need to go for a vector-database? Who’s having experience in this. It’s ofcourse important that the response time is relatively fast.
IMHO, I would focus more on how much you’re willing to spend on database performance. Most modern relational databases can easily handle terabytes of data and mostly limited by file system sizes (so skip VFAT or any severely limited FS). They also have native spatial data types, so your IoT lat+long can be handled efficiently. As for speed, you should first look at the types of queries you’ll be doing so you can arrange your data tables, views, and procedure calls efficiently. What I’m trying to get at is to remove the API as a factor in selecting your database.