Suddenly, [Database] Rows Can Now Have Meaning

bill.french · March 15, 2023, 8:15pm

We all know about ChatGPT. It is profoundly expanding the possibility of creating some very smart systems. Pervasive and near-free access to LLMs (large language models) inches us closer to AGI (artificial general intelligence), which can be applied to apps and data in several ways.

Airtable, of course, can readily enjoy the benefits that services such as OpenAI provide developers. Integrating the power of LLMs for text and code completion is almost trivial. These are magical capabilities, but they aren’t the only capabilities.

Most AI experts and analysts agree that AI will become pervasive in all solutions. The ones that create the greatest customer value will blend application data, user context, and LLMs to create extremely relevant and powerful outcomes.

Data Records That Have Meaning

Airtable search is not a pleasant experience at all. The findability of discrete records in a table is terrible. Locating key data across multiple tables and bases is almost impossible. I have explored this challenge with several clients, and I’m thrilled to say all that work is now obsolete. This paper needs to be burned.

Imagine if we could quickly capture the meaning of a row in a sheet or a record in a database.

LLM embeddings make this possible. Embeddings are vectors, a fancy term for complex numeric tuples or arrays. It’s possible to get a vector for an Airtable record. The vector is a formidable representation of meaning because it is derived by associating your data with a specific vector in an LLM.

By building a simple string of key field values in a table row and using that to determine its word vectors within a model such as OpenAI’s text-embedding-ada-002 LLM, you will capture the mathematical meaning of that record. But to transform this approach into a solution, you need a few more pieces of machinery; a vector datastore.

Vector databases (like Pinecone and Weaviate) have been around for a while. Still, you’ll soon hear a lot more about them because they are necessary to store the natural language essence of any information.

Opinion: If Airtable were on its game, it would already have a vector data store baked into its architecture, but sadly, I predict it will try to solve the search and findability crisis with a Lucene-like architecture that I said should now be burned.

I’m using Airtable data, vectors, and LLMs. It’s a bold and profoundly powerful experience when users can employ natural language to locate their own information. Or to discover related information without being forced to describe deeply limiting relationships through linked records.

curt.kennedy · March 15, 2023, 8:20pm

Very cool! I can totally see the usefulness of embedding rows in a database. Especially if it’s not obvious which “column” you need to search on, but you have a general query.

bill.french · March 15, 2023, 8:40pm

Indeed. This is one of the key advantages - you get to create the embedding cone with selected and “boosted” or weighted columns. It’s very powerful, and the “training” is far simpler.

I have wondered, though, when I use private data to generate a vector, is that data captured and used by OpenAI? Or, is it safe to say that embeddings help to insulate customer data from OpenAI’s general model training set?

curt.kennedy · March 15, 2023, 8:45pm

The general consensus is that the data you send to the API is not private.

But out of curiosity, how are you boosting or weighting the columns?

bill.french · March 15, 2023, 10:49pm

I use vector filters to do this, but it may also be possible to call out specific values in a prompt resulting in an embedding that is given more sway. I have not tested this approach yet, but it stands to reason that you can prompt-engineer your way to embeddings that emphasize certain terms.

My approach so far is more traditional. With the vector in hand, I also attach metadata to the Pinecone vector itself. This makes it possible to order vector results in a manner that emphasizes certain terms or even tokens in longer strings.

Topic		Replies	Views
Reducing Cost of GPT 4 by using embeddings Prompting	23	10653	May 4, 2023
Can someone make embeddings make sense? (Not what you think, more in post, lets discuss!) API embeddings , gpt-4	6	2310	September 19, 2023
Structured Data & Semantic Search : SQL or text-to-SQL or Vector search? Community vector-db	1	2177	June 28, 2024
Tabular data converted to embeddings not returning accurate results Prompting	15	6624	October 13, 2023
Can this api be used to query internal data? API	35	8451	April 20, 2023

Suddenly, [Database] Rows Can Now Have Meaning

Related topics