Tabular LLM - new architecture exploration for the analysis of complex datasets without script writing

It would be really cool if instead of trying to tokenize the tabular data of CSVs with a traditional transformer, we had a model which is designed to take a very large CSV, process it in chunks if necessary, and generate insight based on this data directly. Outliers, trends, implications, etc.

I don’t know all of what would be necessary to do this, but I imagine it would require a new type of transformer which instead of computing next word, computes interrelation of data at various timesteps with proportional analysis.

Here’s an example.

I do some financial analysis, input a CSV of our expense reports, and ask what the greatest expenses are. It should be able to determine through direct inference what that may be without relying on a script.

I firmly believe this is possible, and the implications would be tremendous. Smarter people than me will probably figure this out.

It may be possible with a finetune, I just don’t know. It would be worth exploring.