How to make LLMs learn a book (text + diagrams) just like we humans do?

What would be the best way to make an LLM learn a 300 page book that also has some images?

  1. Is it possible fine-tune it where it reads text and images (diagrams, tables, etc) both?
  2. Does RAG support vectorising and retrieving a pdf that has both text and images?

Alternatively, if anyone has better ideas please do share.

Welcome to the community!

It depends on what you mean with learn. What do you ultimately want to achieve by “learning”?

Fine-tuning will probably not help you here, unless you want to learn the style of the book. It’s not amazing at learning facts.

Multimodal embeddings are still lagging behind, but google has this on vertex: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings

That said, I’m not sure if embedding diagrams and tables as images is the best idea. For tables, it’s probably better to extract the rows and columns, and for diagrams and graphs to extract the function and maybe embed some hypothetical data (depending on what you intend to use it for). GPT-4-V can help with that, but it’s not super reliable.

To work with the images, you will likely want to use a vision model to describe the image, and then train on that.

Thanks!

I am probably dreaming in future a bit, but what I will explain what my meaning of “learning” was here.

So basically I have a couple of chemistry books, and I want to optimize certain reactions. This would usually take a person to do a PhD or read books (including the one I have).

Or I could somehow figure out a way to use these LLMs to help me out. But RAG is not best suited here for two reasons. 1) as you mentioned its not the best, and over huge files I guess even less useful so. 2) RAG’s performance is based on the prompting as it will fetch the stuff that relates to prompts (whereas I want it to have knowledge of entire books while thinking of answer)

I tried getting numerical data and having random forest regressor (works okayish) to optimise my reaction but the data I generate is very less to have a deep network that could maybe learn the laws of chemistry (might need tons of data) from basic reaction data.

I am assuming these models do output human-like language but knowledge ingestion is probably quite inhuman and not so great.

1 Like