Preparing data for embedding

sergeliatko · December 17, 2022, 9:07am

Personally, I think if instead of tokens based on word parts, gpt3 would operate directly with embedded vectors of words/word combinations and do so not linearly but in multy-dimensional space, than it would be definitely closer to how human thoughts are formed and flowing. But no one really knows until we test.

heiko · December 17, 2022, 10:27am

thats an interesting thought. makes total sense to me

sergeliatko · December 17, 2022, 7:47pm

Yes, sort of an “instant flash” of concept associations in a form of a tree with branches in all directions (vectors of simple concepts/words/words combinations), then model selects flashed vectors (using weights) to construct a “selected” path/root and then it transforms those vectors into a final thought (linear this time) which gets decoded into language… I think it’s doable at the stage where we are technically. It’s not my primary domain, so cannot do it. But a serious team could definitely try the concept (c) and build something that would work like this. Maybe just no one could turn this idea into words until now? That would be a beast .

sergeliatko · December 17, 2022, 8:02pm

I have even a name for that:

FLASH

Finsler’s Large Associations Structure Hypothesizer

(c) Serge Liatko 2022

LoL

heiko · December 17, 2022, 8:04pm

sounds great to me :–))

sergeliatko · January 10, 2023, 9:57pm

Diffusion language models – Sander Dieleman the same idea in a more scientific language…

denis.beliauski · May 29, 2023, 10:30pm

@sergeliatko your advice on this topic is super helpful, Serge. Thanks!

Could you maybe share some short sample of non-formatted (plain text) data and the same data after the formatting rules were applied to it?

It will be great for me to see before/after states to ensure I’m doing it correctly.
Thanks in advance!

sergeliatko · May 29, 2023, 10:43pm

Zyve Belarús!

So to be clear:

Before formatting the text by gpt I clean the text (PHP in my case but you might apply same logic in any language of your choice). Then I “chop” text string into pieces to send for formatting.

Here is the PHP I use:

sergeliatko · May 29, 2023, 10:47pm

Here is the gist with the php code I use to pre-format/cut raw text into pieces to send to API : Php code to preformat raw text and cut it into chunks before sending to OpenAI API · GitHub

denis.beliauski · May 30, 2023, 12:25am

@sergeliatko
Zyve Vechna!
Thanks again, Serge!

sergeliatko · May 30, 2023, 5:29am

Now what the formatting does.

Input (copy-paste from pdf, basically many OCRs give you text back like this)

Software Development Agreement

This Agreement made at __________ this ______ day of __________

BETWEEN

M/S A.B.C. & Co. Ltd. a Public Company limited by shares registered

under the Companies Act, 1956 and having its registered office at …

… hereianafter referred to as ‘the Customer’ of the

One Part;

AND

M/S XYZ & Co. Ltd. a Public Company limited by shares registered

under the Companies Act 1956, and having its registered office at …

… hereinafter referred to as ‘The Contractor’ of the

Other Part;

WHEREAS (a) the Contractor has agreed to write certain Computer program for

the customer and to provide other services hereinafter mentioned

upon the terms and conditions hereinafter mentioned.

NOW IT IS AGREED BETWEEN THE PARTIES AS FOLLOWS: (1) The Contractor agrees to design and write the software in the

programming language as per the specifications given in the Schedule

hereunder written and to deliver and install the software on the computer

hardware equipment of the Customer and which is installed at the office

viz.________ (hereinafter referred to as the Computer equipment). The

Contractor shall also supply the Documentation alongwith the software

and to render such other services as hereinafter mentioned.

(2) The Contractor admits that it has received all information about

the specifications above referred to and the said Computer Equipment

and the Contractor will not be entitled to raising any objection on the

ground of misinterpretation of any fact relating to the functions, facilities

and capabilities of the equipment or any part thereof. (3) The Contractor agrees and undertakes to carry out the project

that is the Deliverables

to be supplied and all services to be provided by

the Contractor under this Agreement with reasonable care and skill with

the help of qualified and experienced personnel.

(4) Similarly the Customer will provide the Contractor with accurate

and complete information as will be necessary concerning the customers

operation and shall answer to queries, decisions and approvals reasonably

necessary for the Contractor to undertake the project.

(5) The Customer shall provide the Contractor and its personnel free

access to and the use of the said computer equipment, offer full help of

the Customer’s employees and shall provide electric power, lights, heating

and air conditioning facility at the site of the Computer Equipment and

all other necessary facilities reasonably required for the working of the

Project.

(6) The Customer guarantees that the computer equipment and all

the software and other the deliverables, used in the said computer equip

ment is the property of the Customer and the Contractor is permitted to

use the same.

(7) Each party hereto will appoint its qualified representatives who

will hold meetings between them from time to time to provide progress

reports and to discuss the same and other issues.

(8) The bespoke

3 software will be written with the use of the said

Computer equipment which will be maintained by the Customer and at

its own costs but solely for the purpose of the Project. Protected Information. (9) The parties agree…

sergeliatko · May 30, 2023, 5:33am

Output

Software Development Agreement
This Agreement made at ___ this ___ day of ___
BETWEEN
M/S A.B.C. & Co. Ltd., a Public Company limited by shares registered under the Companies Act, 1956 and having its registered office at … hereinafter referred to as ‘the Customer’ of the One Part;
AND
M/S XYZ & Co. Ltd., a Public Company limited by shares registered under the Companies Act 1956, and having its registered office at … hereinafter referred to as ‘The Contractor’ of the Other Part;
WHEREAS
(a) the Contractor has agreed to write certain Computer program for the customer and to provide other services hereinafter mentioned upon the terms and conditions hereinafter mentioned.
NOW IT IS AGREED BETWEEN THE PARTIES AS FOLLOWS:
(1) The Contractor agrees to design and write the software in the programming language as per the specifications given in the Schedule hereunder written and to deliver and install the software on the computer hardware equipment of the Customer and which is installed at the office viz.________ (hereinafter referred to as the Computer equipment). The Contractor shall also supply the Documentation along with the software and to render such other services as hereinafter mentioned.
(2) The Contractor admits that it has received all information about the specifications above referred to and the said Computer Equipment and the Contractor will not be entitled to raising any objection on the ground of misinterpretation of any fact relating to the functions, facilities and capabilities of the equipment or any part thereof.
(3) The Contractor agrees and undertakes to carry out the project that is the Deliverables to be supplied and all services to be provided by the Contractor under this Agreement with reasonable care and skill with the help of qualified and experienced personnel.
(4) Similarly the Customer will provide the Contractor with accurate and complete information as will be necessary concerning the customers operation and shall answer to queries, decisions and approvals reasonably necessary for the Contractor to undertake the project.
(5) The Customer shall provide the Contractor and its personnel free access to and the use of the said computer equipment, offer full help of the Customer’s employees and shall provide electric power, lights, heating and air conditioning facility at the site of the Computer Equipment and all other necessary facilities reasonably required for the working of the Project.
(6) The Customer guarantees that the computer equipment and all the software and other deliverables used in the said computer equipment is the property of the Customer and the Contractor is permitted to use the same.
(7) Each party hereto will appoint its qualified representatives who will hold meetings between them from time to time to provide progress reports and to discuss the same and other issues.
(8) The bespoke 3 software will be written with the use of the said Computer equipment which will be maintained by the Customer and at its own costs but solely for the purpose of the Project.
Protected Information.
(9) The parties agree…

See how broken lines became whole paragraphs, notice Protected information be on a separate line and each paragraph/clause is separate from the others by a line break? That’s way easier now to work with this text.

denis.beliauski · May 30, 2023, 8:17pm

Cool, that looks great. Thanks again Serge!
Not sure if langchain (we use this one) does it automatically, so will also check it (probably it does some data “massaging”), but anyway - it’s helpful to see the input/output.

Topic		Replies	Views
Poor quality response on trained LLM with pdf files Community gpt-4	29	6343	May 1, 2024
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	20	5235	June 7, 2025
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4493	January 26, 2024
The length of the embedding contents API	48	34417	December 13, 2023
Cannot get GPT-4o to count bullets in a markdown file Prompting api	28	1204	August 7, 2024

Preparing data for embedding

Related topics