A valid, working PHP library to help compare vectors for embeddings

Drupal is one of the oldest and most well known content managment systems available. It is written in PHP. I am working with an individual to try and create a Drupal module to do vector searching of site content. Here is the full discussion: https://www.drupal.org/project/openai/issues/3337774

We’ve worked out the logistics of embedding the site content and search text. It’s running the vector comparisons where we’ve hit a roadblock. We can do it with a simple PHP function, but given the size of the pages we are talking about, that would be a very inefficient and latency prone way of doing it.

Note this paragraph specifically: https://www.drupal.org/project/openai/issues/3337774#comment-14908756

we need to identify a valid, working PHP library to help compare these vectors - or if a MySQL stored procedure can do the same. Most of the library examples I see are for Python.

This is a bit outside of my skillset, so I’m wondering if anyone has any suggestions here?

1 Like

The code to compare vectors is trivial and so you do not need a library for that part of your puzzle.

For example, here is my Ruby “dot_product” method which is same same as the “cosine_similarity” method for OpenAI embeddings (unit vector length of 1):

def self.dot_product(a, b)
    a.zip(b).map { |x, y| x * y }.reduce(:+)
end

It is trivial to convert this to PHP with ChatGPT, but you should verify the Chatty guy of course:

HTH

Note, here a simple PHP function for the dot production, courtesy of the OpenAI API. Please test it before using :slight_smile: :

function dot_product($a, $b) {
    $result = array_map(function($x, $y) {
       return $x * $y;
    }, $a, $b);
   return array_sum($result);
}

Hopefully, this helps and you get the idea (how easy it is to compare linear vectors in PHP). No importing libs required. Just put the function in a loop and compare the vectors and sort the output (rank them).

1 Like

I’m not following your statement above.

In your “posts” table, where the text of the posts are stored, you would add a vector column to the DB and then for all legacy posts, run a one time routine to add the vectors to the DB table of each post.

For future posts, you could add the vector to the table in a crontab file at night when traffic is low to update the vectors in the table for the new posts for the prior day (or you can do it more often or even in near-real time, that up to you).

Then, when a search is preformed, you simple vectorize the search term and run a simple routine in the DB to rank the dot product of each test and then output the results in your search results display page.

The above functions must be accomplished with a “library” or “with a few functions”, so your original statement about “you need a lib for efficiency” does not “hold water” from a coding perspective.

However you approach it, the steps are basically the same to do a vector-based search of your forum posts.

Hope this helps.

Without a vector search, this does not leave many avenues for local storage to accomplish this. You’d be stuck loading several records just to loop and compare, where something like Pinecone can do that heavy lifting a million times faster.

What I was saying above is that we can do it with a looping call to a function like you provided here: A valid, working PHP library to help compare vectors for embeddings - #2 by ruby_coder

But, that’s going to take forever to search relatively large sites. I’m sure I’m not explaining myself correctly, but what we are looking for is a way to do it faster.

It will not take “forever” but it will take a longer time than performing a full-text search.

Assuming your CMS (like this one) is currently already performing a full-text search, it will not see performance gains (speed) if you switch from full-text DB (indexed) searches to a vector-based approach.

There is no ‘cheap and easy way’ to do a vector-comparison search, because as you have mentioned @SomebodySysop, you must take all the vectors in the post table in the DB (which you seem to not want to do) and compare them to another search-term vector using some comparison function (like the dot_product or Euclidean distance, etc).

Personally, I do not see any strong benefit to adding an embeddings based approach to a MySQL DB driven CMS unless you just “want to experiment” with embedding vectors, because MySQL already has a robust and very mature (and capable) full-text search feature. After all, it’s a “CMS” and the keyword searches word fine using full-text search (as well as “LIKE” based matches).

If performing semantic searches in a DB using embedding vectors were “better” than full-text searches, don’t you think MySQL and just about every other DB provider would already have this in their products? Using embedding vectors for searching text in a DB is not SOTA.

What do you hope to gain by all this work?

Is your Drupal CMS set up for full-text search now? Many OOTB CMSs are not configured for full-text search and so they must be configured properly for FTS to work.

We hope to reduce the amount of time to do the vector search as much as possible. Actually, this might be a way to do it (as suggested in the post): OpenAI

Personally, I do not see any strong benefit to adding an embeddings based approach to a MySQL DB driven CMS unless you just “want to experiment” with embedding vectors, because MySQL already has a robust and very mature (and capable) full-text search feature. After all, it’s a “CMS” and the keyword searches word fine using full-text search (as well as “LIKE” based matches).

I asked ChatGPT what are the advantages of a vector search over a keyword search:

Yes, there are advantages to a vector search over a keyword search. Vector search uses semantic similarity to find relevant results, whereas keyword search relies on matching exact terms. This means that vector search can handle synonyms, typos, and semantically similar terms, making it more robust and flexible. Vector search can also handle multi-word queries, whereas keyword search requires exact matches. Additionally, vector search can handle context and relationships between terms, providing more accurate results than keyword search.

I’d like the ability to do both.

You should be careful with ChatGPT. Your question is already biased and the reply was also biased.

Why don’t you ask ChatGPT something less biased, like this?

What are the advantages and disadvantages of vector-based DB searches versus full-text DB searches?

ChatGPT says … and there seems to be biased and not really accurate in the bullet points as well:

Advantages of vector-based DB searches:

  1. Precision: Vector-based searches are precise because they compare exact values in the vectors. This allows for more accurate search results, especially when dealing with numerical data.
  2. Performance: Vector-based searches can be faster than full-text searches, especially when dealing with large amounts of data, because the search only needs to compare values in the vectors rather than searching through all of the text.

Disadvantages of vector-based DB searches:

  1. Complexity: Vector-based searches require a lot of pre-processing and can be complex to set up. This can make it more difficult to implement and maintain vector-based searches compared to full-text searches.
  2. Limitations: Vector-based searches are limited to numerical data and cannot handle text data. This can make it less versatile compared to full-text searches.

Advantages of full-text DB searches:

  1. Versatility: Full-text searches can handle text data, making it more versatile than vector-based searches. This allows for searching through large amounts of text data, such as product descriptions, article content, etc.
  2. Ease of use: Full-text searches are often easier to set up and maintain than vector-based searches, making them more accessible to users who may not have the technical expertise to implement a vector-based search.

Disadvantages of full-text DB searches:

  1. Inaccuracy: Full-text searches can be less precise than vector-based searches because they compare the entire text, rather than exact values. This can result in less accurate search results.
  2. Performance: Full-text searches can be slower than vector-based searches, especially when dealing with large amounts of text data, because the search needs to scan through all of the text data.

Because in most DB applications, the text is actually not that large, and the column is indexed, you can easily see that the reply above from ChatGPT appears biased toward vector approaches, but in practice (real-world), with most DBs doing full-text search on indexed table entries, many (like me when I have been testing) find vector-based approaches slower (and less accurate for the average length of a search term).

Also, the OpenAI vector-based approach is “not free”, these is a cost.

Furthermore, many people using vector-based approaches are using APIs which require addition network API calls which degrade performance (or break) when the network is slow or down.

You must be careful when asking ChatGPT a question @SomebodySysop. You must phrase your question in an unbiased way, you should have some domain knowledge of the subject so you are aware of the biases in the GPTs reply.

GPTs are not “expert systems” @SomebodySysop. GPTs are a type of text auto-completion engines :slight_smile:

OK… here is a simple example of both:

Vector DB Search “Hello World?”

Basic Wildcard “LIKE” DB Search “Hello World”

Which is better?

For CMS search engines, there will be a lot of “bad matches” for short phrases because vectorized text must be around 300-500 tokens (or words, need to confirm) to provide a vector which is useful.

Most people doing a CMS or forum search will use much smaller search terms and the imprecision of using this type of vectorization search will return a lot of “useless” or “nonsense” search results (as in the practical example above).

We can improve this with larger search strings, but in general, we will not get better results than a full-text search; especially for a text-based CMS system of articles where users tend to do searches with short phrases and keywords.

In other words, as a systems engineer, I advise you NOT to worry abou overall system performance until you have actually implemented a vector-based search approach so you can see the search results yourself.

If you want to implement a vectorized search feature, just do it and worry about the performance issues AFTER you are happy (or not) with the search results. Don’t follow the hype, ChatGPT or trendy tech discussions, people trying to sell you a product or a consulting service; just “do it” by coding it yourself. It’s not hard to implement, as you have already mentioned @SomebodySysop .

Hope this helps.

1 Like

Yes, you are right. I’m thinking in many cases this might be an improvement over keyword searching, but I don’t know for sure. We won’t know for sure if this is what we want until we know. Doing just that now. I can’t believe how far I’ve come in so little time. Thank you for all your assistance!

Yes, the fact that you are even interested in this topic is very cool.

You are welcome. As a guy with over 40 (maybe closer to 50) years IT experience, all of my life basically, and a formal engineering education, I advise caution when you are discussing tech with a “domain expert” human and posting ChatGPT auto-completion blah, blah as “facts”. ChatGPT is really cool and very fun; but ChatGPT generates a lot of “nonsense” because ChatGPT is not a technical or domain expert. ChatGPT has never written a single line of “real production” code (where it had to debug it and test it for when " the brown stuff hits the fan" or solved a client’s crisis / problem at 3AM when they are under a cyberattack or critical systems are down. ChatGPT is a type of auto-completion engine that predicts text based on an incomplete model of “the world”. It has no practical domain knowledge and is just making people happy “autocompleting” away without any awareness of the domain they are generating text for.

ChatGPT writes very good fiction. Accurate, reliable technical details, you can depend on, not at all. But of course, ChatGPT speaks with perfect English and is very confident, LOL!

HTH

I’ve got a few more weeks under my belt. It does write very good fiction. But, I still find it very, very useful for code development.

All these YouTubers exclaiming that AI is going to replace human programmers are just laughable. If you don’t know how to program yourself, ChatGPT is almost useless. But, if you do know how to program, and have specific, small tasks you want coded while you think through the bigger picture, then ChatGPT is priceless.

ChatGPT is helping me figure out how to do my document embeddings using PHP instead of Python. Yes, I could learn Python, but having spent the past few years with PHP, I’d rather leverage that knowledge. For example, I’ve been asking ChatGPT to convert python code to php, and I can immediately see and understand what the code is doing, rather than being clueless and struggling to understand both the code and what it is supposed to be doing.

And, while I can’t get anyone from PineCone to return an email, ChatGPT will patiently, and politely, hold my hand all day.

All that to say that I agree: ChatGPT is brilliant, in an 8-year old child prodigy sort of way.