How to make LLMs learn a book (text + diagrams) just like we humans do?

shyamfoxface · March 4, 2024, 4:31pm

What would be the best way to make an LLM learn a 300 page book that also has some images?

Is it possible fine-tune it where it reads text and images (diagrams, tables, etc) both?
Does RAG support vectorising and retrieving a pdf that has both text and images?

Alternatively, if anyone has better ideas please do share.

Diet · March 4, 2024, 5:16pm

Welcome to the community!

It depends on what you mean with learn. What do you ultimately want to achieve by “learning”?

Fine-tuning will probably not help you here, unless you want to learn the style of the book. It’s not amazing at learning facts.

Multimodal embeddings are still lagging behind, but google has this on vertex: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings

That said, I’m not sure if embedding diagrams and tables as images is the best idea. For tables, it’s probably better to extract the rows and columns, and for diagrams and graphs to extract the function and maybe embed some hypothetical data (depending on what you intend to use it for). GPT-4-V can help with that, but it’s not super reliable.

jwatte · March 4, 2024, 6:30pm

To work with the images, you will likely want to use a vision model to describe the image, and then train on that.

shyamfoxface · March 4, 2024, 6:58pm

Thanks!

I am probably dreaming in future a bit, but what I will explain what my meaning of “learning” was here.

So basically I have a couple of chemistry books, and I want to optimize certain reactions. This would usually take a person to do a PhD or read books (including the one I have).

Or I could somehow figure out a way to use these LLMs to help me out. But RAG is not best suited here for two reasons. 1) as you mentioned its not the best, and over huge files I guess even less useful so. 2) RAG’s performance is based on the prompting as it will fetch the stuff that relates to prompts (whereas I want it to have knowledge of entire books while thinking of answer)

I tried getting numerical data and having random forest regressor (works okayish) to optimise my reaction but the data I generate is very less to have a deep network that could maybe learn the laws of chemistry (might need tons of data) from basic reaction data.

I am assuming these models do output human-like language but knowledge ingestion is probably quite inhuman and not so great.

Topic		Replies	Views
Building Own Knowledge Base LLM Community embeddings , chatgpt , api , assistants-api	3	8788	April 8, 2024
Leveraging LLMs with Vast Mechanic Datasets and Guides API api	6	2335	August 31, 2023
Knowledge Retrieval: support for PDF images Feedback knowledge-files	9	2059	October 28, 2024
Add book content to the model (both details and full context of the book) API embeddings , fine-tuning , rag	6	99	March 23, 2025
Reading Longer Documents/Inputs API	1	408	February 10, 2024

How to make LLMs learn a book (text + diagrams) just like we humans do?

Related topics