Generic Semantic Matching Functions


Hopefully this post is not already covered by someone else. As I’m sure many others have observed, a major utility of the API is for identifying in various ways semantic duplicates. I work in the print mail industry, so for example, we get a collection of files containing information about people to mail from a client, and we need to extract the address information and put that through some postal sorting software. One file has fields such FNAME, ADDRESS1, etc whereas another has first, street_address, etc. Rather than building import templates for all of these different file structures, or trying to generate a very long set of search rules to somehow put them all together, it is great to simply pass to the API a list of fields that I want to find and the list of fields I have and ask it to match them up. I then take the returned JSON and use that to rename the fields, and voila, I no longer need hundreds (we have a lot of clients and a lot of file structures to deal with) of import templates or rules. It is a little bit more complicated than that, but the example should be clear enough.

There are so many utilities for this sort of generic semantic duplication matching, I feel like there ought to be some sort of pre-built feature into the API as a callable option for this kind of matching, something like this:

Function name: semantic match
Right: first list of values to match
Left: second list of values to match
Match type: {“all”:“every value in both the right and left must be matched”
“left”:“every value in the left list must be matched”
“right”:“every value in the right list must be matched”
Dupe=False : if True, means that each item can only match to one item in the other list

Returns: A JSON object indicating matches between right and left.

At this point, I’ve basically built API calls to do all this, but it is such a generic utility that I’m re-applying across tasks that I feel like it should just be directly available through the API. Obviously, my suggested names might be bad, and if you all know of something like this already, I’d be glad to hear about it, but figured I’d throw out my suggestion.

There is a name for what you describe: Named Entity Extraction (NER)

It can be done for keyword: value pairs predefined, or can come up with its own unique fields per document. Chat models can do this by instruction.

I did some looking into this after your post, and this seems worth my time to study further, but I’m guessing it will not be nearly as easy to deploy as what I’m envisioning/using the API for right now. Thanks for the reference!

Any idea if matching can be done based on some predefined ontology, where on one side of a ontology are donors, and on other side are recipients?

Interesting question. I had not explored this much further than described in the thread. Once I got my basic prompt working, I got a couple of things automated and then moved on to other tasks.

How would it know to pair up donors and recipients? In this case, I’m not sure what would link them, but maybe you have an example in mind.

1 Like