Create a crosswalk with gpt

Say I have two datasets, A and B, each with the column msa_name. (Metropolitan Statistical Area). I want to merge both datasets on msa_name. The problem is that msa_name is not identical in both datasets. For example, ‘Boston (MA)’ and ‘Boston-Cambridge (MA)’ are clearly same msa, but a standard match would not work. There are approx 400 MSA areas in the US.

With chatgpt, I’ve tried to give ALL msa_name of A and one of B, and ask him to find the match. And then loop over all msa_name in B. It works but:

  • I feel I’m wasting a lot of tokens by giving the entire list of A each time.
  • The ouput is not consistent across calls, even lowering the temperature. I’d like the output to be just ‘Boston-Cambridge(MA)’, but the result usually is something like"The corresponding MSA for Boston (MA) is Boston-Cambridge(MA)'. With different wording, spaces etc. I tried to provide examples in the prompt.

How should approach this problem? Ideally, I would like to pass the list of A only ONCE, and then give elements of B and get the corresponding element of A.

What would be even better would be to train the model in a way that it remembers A and B, and then I can ask for the crosswalk in both directions.

Hi and welcome to the developer forum!

GPT models are stateless, when you send a prompt in, that is the first thing it has ever seen, that iis always the case. There is no memory of past event’s, a simulation of that is formed by sending the model the old chat text to use a context, so no, there is no way to reduce sending everything needed each time.

On to the my main point, what you are describing seems like it could be done with traditional database techniques and methods, using an LLM for this task seems like the application of the wrong technology to solve the problem, can I ask why it is that you are not using a python script and SQL queries to preform this operation?

Thank you Foxabilo. On your first point, true that the GPT model is stateless, but I was thinking if it was possible to train the model with that given data, so the information on the crosswalk is on the model parameters. Worst case, I can keep sending the entire list as the prompt, but train the model to give consistent output.

For the second point, several answers.
(1) I want to experiment/learn with GPT and see if it does the job properly.
(2) Other merging techniques are still not perfect, maybe a LLM can do it better, especially for small datasets that fit into the prompt.
(3) Sometimes the crosswalk is not based on text similarity but on meaning. For example, I would like to match ‘meat’ with ‘beef’ or ‘chicken’.

Understood, regarding point :
(1) Great! always great to see people experimenting and creating new things.
(2) I think that ties with point (3)
(3) It might be worth experimenting with ADA embeddings model to create a vector DB of your database, a semantic similarity search should yield good results for matching records.

On the topic of bringing your data into the model, the only way to influence the weights and biases currently is via a base model fine-tune, note that I did not say affect the models data, fine-tuning will influence the models ways of thinking, not what it knows. A good example of this is teaching a model an authors written works, that will teach the model how the writer wrote, not what they wrote.

I think realistically, embeddings is the way to move forwards, even with larger context models their attention heads (the method by which they determine important parts of a prompt) are finite, and there is a good possibility that attention over a large prompt may be reduced to the extent that data is insufficiently prioritised as to cause a drop in performance.

1 Like