Preparing data for embedding

Since text embedding seems to be the preferred way for doing a q&a bot : as I understand one is supposed to decide the knowledge base into „semantically self contained“ chunks of text.
Is there anyway Gpt-3 can help with this ? Or has that to be done manually ?

2 Likes

Can you explain a bit more ?

Hey, what does a standard self-contained chunk looks in your case? I had a similar issue with legal docs and had to add “container” labels before the embedded text parts to allow better results in retrieving the vectors.

2 Likes

exactly - its legal and tax knowledge in text form im talking about

Any shareable piece of text to see as an example?

not really - yet. we are in the process of collecting the data. so realitically there will be something beginning next year. I hoped there where some general best practices I could look into

Run into similar problems. I was able to improve the search significantly with two methods:

1 - Add all the relevant metadata at the beginning of the chunk. Make sure to split the labels by special separators such as “DOCUMENT TITLE: title. USER: user…”

2 - Add contextual information about relevant aspects of previous chunks at the beginning of each chunk. Sometimes, there are some particular entities that you want to propagate through the chunks. I used to store a sort of logbook about important info of the document (mainly entities detected via NER). Then, propagate this dictionary through the chunks as well.

2 Likes

Here is the workflow I took with legal docs (from raw text or OCR):

1 Clean text (replace repeating spaces by one space only, trimming empty lines, end of lines, some other basic formatting, etc.) to get a “cleaned text.”
2 Pass cleaned text to a model for paragraphs formatting: the model makes sure each paragraph does not have line breaks inside; if finds a potential title - makes sure it is on its own line; if finds list items, makes sure they are on separate lines, repairs word breaks (e.g. “po-tential” is changed to “potential”). Basically, the code to create prompts for the model takes a chunk of the text of 2k characters, finds an appropriate position to break the chunk without breaking paragraphs, then the next 2k characters, and so on. So that I have a list of text chunks to format that I pass to the model. Once formatted, I get a “formatted text” with titles, list items, and paragraphs on separate lines.
3 Break the formatted text into lines and pass them one by one to another model that detects the line purpose: header (doc header), title (section title), paragraph, list item, metadata.
4 walk the resulting lines and join list items together with the preceding paragraph (so that I have entire lists), join metadata into separate paragraphs.
5 Walk through the lines again to build sections: title + all following paragraphs until the next title.
6 Pass the sections to another model that detects their child/parent relation so that I have a nested array of parent/children sections
7 For each section that has paragraphs in it, the embedded text looks like the following:
The section title
Root header | Grand Parent Title | Parent Title | The section title
The section paragraphs.

If paragraphs are too long for embedding, I split them with the text chunks splitter from above and make separate embeddings with the same headers.

10 Likes

that sounds super useful. let me digest that. thanks

This was really helpful. Thanks. Good thing is : many of those steps you can just ask Gpt-3 to do for you :slight_smile:
Now I am wondering how to setup a conversational interface so that the user could say something’s like „show me court decisions about tax xyz“ or something similar

I opted for fine tuned models and I mostly was using playground to generate/test prompts for davinci (1 to 3) to get the “instructions master prompt”, then pass that with samples of data to API endpoint to generate training samples, then check manually training samples to improve quality. At last, fine tuned cheaper (Curie) models with that training.

As a result, pretty much the same quality but faster and cheaper.

Are you talking about “how to setup the interface” (design app front end) or “how to design interactions workflow”. On my opinion, you would do better if started with interactions workflow.

2 Likes

Also, I think I’ll end up with embedding paragraphs one by one (or by more than one if “stitcher model” decides they go nicely together) with the mentioned headers. After some tests, I see much better results in retrieving the info from embedded facts. Just include several of found facts into the prompt.

2 Likes

This one might be interesting https://youtu.be/wwbr0fombFs

1 Like

Valid question :slight_smile: I mean fronender in a way of an conversational bot. Not design. But how to bring the embeddings and a gpt-3 frrontend together to make life easier for knowledge workers

I still can’t get my head around Fingerübung. If I look at the examples it’s always question / answer pair. But in many cases I don’t have any questions. It’s just knowledge unstructured in a text. No idea how to fine tune with that

That’s a whole other subject. I’ll expand this over the weekend. But to give you some background I often use:

People often think large language models are somewhat “smart”, but it is better to assume they are just too good at tricking it. We believe they can get input in, “think it over,” and produce “thoughtful” answers (completion). I think models are a sort of conditional reflex (on steroids, for sure) - they see “this” as input, and they guess the best output would be “that”. Sure, they have internal multi-step “thinking” inside, but at the level of abstraction we can operate as of today, that’s too detailed to be considered.

While thinking, humans operate with concepts interconnected into thoughts. And our process of thinking is a chain (or rather a tree) of thoughts that follow some rules. So if you want to make your app “think,” - you’ll have to give it the things it needs:

  • Background knowledge (your embedded facts)
  • Initial thought (your request)

But then you need to help it extract and understand the concepts in the background facts and the initial thought (analyze the request/facts, extract concepts and their relationships, understand the query’s intent, see patterns in the knowledge, etc.) and show the whole “processing” tree/chain/logic to get the solution (just like kids).

Then based on this understanding, a model (out of many) will be able to do one step in your “thinking” process (they are good at doing one step).

But do not ask it to do the whole thing. You’ll have to create “thinking patterns” that lead to the final result you want and teach your app (a tribe of models in this case) to do one step at a time and send the “business” to the next step (sending to the next step often is just a block of code in your app).

To sum up the idea, you need to thoroughly analyze the ways (patterns) you solve those requests (by type), write down the steps, and input/output at each of those, and it will give you an idea of what your core engine should do. Then start with one pattern at a time and train models for each step with what their input/output should be (and what to do if the model fails). Many steps do not require “models”, just good old-style code.

As for question/answer training, I would replace “question” with “request” to help better understanding. And the requests might look like this:

Task: task description.
Background information: all your needed facts.
Metadata (whatever label fits): stuff the model needs to know about the request.
Model’s previous answers to similar requests: (if fit, a couple of examples)
Previous conversation: short summary of what happened in the chat.
User inquiry: the user input (filtered and sanitized, of course)
User intent: (from one of the previous steps)
Model’s answer (or possible answers):<|endoftext|><|reply|>

Model replies in training data should start with white space and (little tip from me, not necessary but useful for quick tests) end with <|endofreply|><|endoftext|>

In API calls, use <|endoftext|> as a stop sequence and check replies completeness with regex on <|endofreply|> at the end. Or you can check the stop reason in API response (better in production).

5 Likes

This is so super helpful. And I totally understand this “statistical prediction thing” gpt-3 does (since it works so well in the human domain one might think if we - at the very core - do the same thing )

And if I understand you correctly:
In the documentation, it says the json format for fine-tuning is always promt->completion
And the prompt if I got you right would be this whole thing below. So in the prompt section, I don’t have to necessarily provide a question but can give contextual information as you said below?

Task: task description.
Background information: all your needed facts.
Metadata (whatever label fits): stuff the model needs to know about the request.
Model’s previous answers to similar requests: (if fit, a couple of examples)
Previous conversation: short summary of what happened in the chat.
User inquiry: the user input (filtered and sanitized, of course)
User intent: (from one of the previous steps)
Model’s answer (or possible answers):<|endoftext|><|reply|>

Basically yes. Just remember that it might take a lot of steps before to get the items to compose your prompt as well as a lot of steps to interpret (reformat, fact check, reference check, moderate etc) the result. Same as human thinking/problem solving.

1 Like

Thanks. Really appreciated

You are welcome. Let me know how it goes. Always useful to see other’s pitfalls;)

1 Like