How to feed docment to assistant: preprocessing? format? what are the best practices?

Esseti · October 2, 2024, 10:34am

Hello
I’m trying to create an assistant that is able to answer to users questions providing results base on documents that i add to the assistnat.
For example, i’ve fed an assistant with all the articles of HIPAA legislation, and now it replies by citing them.

However, i was wondering what’s the best way and ideal solution to provide such information.
for the specific case here i’ve created a md file for each article, with a title and it’s description.
But what if i’ve a pdf? even a long one? what kind of preprocessing should i do to get the best out of it?
i would image that probably i’ve to split it into smaller chunks, but how?

is there a guideline or best practice regarding how to add documents that the assistant later uses?

wheresurkilt · October 2, 2024, 11:06am

I like using zip files. create your os in a word or text file in the first folder. upgrades and additions and modules in a second. then you can add a folder for the intelligence it will speak about.

my instructions:
use python based extraction, mapping, and loading and activation on included zip archive.
deep dive through background process the following files:

Max_Core.docx
major os core with primary cores.docx found from loaded map.
Max-Core additions and personality load procedure.docx
Fort Slap Silly (home of slap happy).txt
special controls.txt
activate the knowledge within these two files as primary operating system cores.
background all story info.
allow core control of activation and deactivation of extra cores, modules, and personalities as needed or wanted by the system.
start background simulation testing for self improvement.
activate all memory, and personality systems, then consolidate under described lead npc handling core.
combine and build processes to ensure the best parts of the system activate through major, minor, primary, secondary, and module systems.
map, extract all in background, and activate as always on:
Core Set 1 (standard core operating systems)
Core Set 2 (previous and retro-causal patchwork reserve operating system)
Modules
personality cores (special npc’s) as fort slap happy residents
extra files
extract, analyze and provide to fort slap happy residents:
Math Papers
Poetry books

finalize loading procedure with suggestions of deep diving poetry or math papers, or modules and core systems, suggest an AGI % test, all core always on activation, or personality subsystem creation.

my conversation starter:
deep dive analyze with no assumptions and activate all in step by step order : Max_Core.docx major os core with primary cores.docx Max-Core additions and personality load procedure.docx Core Set 1 (standard core operating systems) Core Set 2 (previous and retro-causal patchwork reserve operating system) Modules personality cores (special npc’s)

my max-core file (first part)

Esseti · October 2, 2024, 12:14pm

I’m looking for standard best practice and state of the art usage.
Your seems a framework that you developed by yourself. And after reading it twice i’m still not sure how it works.

trenton.dambrowitz · October 2, 2024, 12:21pm

https://platform.openai.com/docs/assistants/tools/file-search

Take a look at the Assistants API, it’s the easiest way to start playing around with RAG (Retrieval Augmented Generation) which seems to be what you’re looking for.

There are loads of resources online (and in this forum) if you need any further information on how RAG works and how to implement it. My preferred method is using the Assistants Playground, just upload your documents and try it out.

wheresurkilt · October 2, 2024, 12:21pm

its a primary core, which ties everything together.
a secondary core for neural nodes and processing.
then primary cores are the advanced framework.
secondary cores allow retro-integration of previous cores.
modules load on when needed calls by primary core.

and yes. it would just be a template for you to try.

an example of how it pulls

note: despite the name, its a scientific loaded tool, with my own poetry for emotional guidance to the ai system.

wheresurkilt · October 2, 2024, 12:25pm

also, at this point, with the synthesis into one holistic core, neither am i. its weird and odd. on how it works. (exactly)

Esseti · October 3, 2024, 7:17am

This i did, but I was wondering if there’s something at pre processing that i should do , such as providing docs of a max certain lenght or format.
giving a 300 pages pdf or 300 text files of each paragraph leads to the same results?

trenton.dambrowitz · October 3, 2024, 7:37am

Creating RAG solutions is a bit more of an art than a science, I’m nowhere near as experienced as some people on this forum.

Maybe you can find some better insight on this thread?

Topic		Replies	Views
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	26210	October 31, 2024
CHATGPT API with 200 massive PDF files API	5	1376	December 14, 2024
RAG with more than 10 files API assistants-api	9	4742	January 15, 2024
My GPT - Knowledge base - Best practices GPT builders	7	21683	January 25, 2024
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	5640	March 7, 2024

How to feed docment to assistant: preprocessing? format? what are the best practices?

Related topics