Best practice for pre-processing data for a chat-boat

I’ve developed a chat-bot with GPT-3.5-turbo to answer questions about our technical manual.
The results are quite good (sometimes very impressive) but i’ve some doubt on how to process the data in the best way.

I’ve devided our techinical manual (.docx) in sections using python-docx:

  • every title is a new section
  • if the section exceed a token limit (i.e. 500) i split the section adding a sub-titile
  • every section has a title in this form: MAIN TITLE | subtitle 1 | subtitle 2 | … | paragraph (if splitted) | \n
  • all the table are converted in json and i use tags to notify GPT the presence of that table: <start-table> json here <end-table>

Here an example of section (in Italian):

 Comandi e scambio dei dati | Metodo GET | Query String | next | 
È possibile specificare il numero di record da cui partire se è stato restituito un next in un precedente metodo GETGET	https://base_address/webapi/risorse/articoli?fields=codice,descrizione&next=0F45AD69

Here an example of section with table:

 Comandi e scambio dei dati | Riassunto dei comandi RESTful | 
<-- Inizio Tabella --> {"Header": [["", "Metodo http", "PATH", "Query String", "HEADER", "BODY", "Success http Code"], ["Creazione", "POST", "https://base_address/webapi/<nome_risorsa>", "SI", "Authorization\nContent-Type\nCoordinate-Gestionali", "{\n<dati>\n}", "201 (Created)"], ["Lettura", "GET", "https://base_address/webapi/<nome_risorsa>\nhttps://base_address/webapi/<nome_risorsa>/<codice>", "fields\nmax\nnext\ninfo", "Authorization\nContent-Type\nCoordinate-Gestionali", "{}", "200 (OK)"], ["Aggiornamento", "PUT", "https://base_address/webapi/<nome_risorsa>/<codice>", "SI", "Authorization\nContent-Type\nCoordinate-Gestionali", "{\n<dati>\n}", "204 (No Content)"], ["Eliminazione", "DELETE", "https://base_address/webapi/<nome_risorsa>/<codice>", "NO", "Authorization\nContent-Type\nCoordinate-Gestionali", "{}", "204 (No Content)"], ["Ricerca", "POST", "https://base_address/webapi/<nome_risorsa>", "fields\nmax\nnext", "Authorization\nContent-Type\nCoordinate-Gestionali", "filtri:[\n{\u2026},\n{\u2026}\n]", "200 (OK)"]]} <-- Fine Tabella -->

Using this approach i’ve some doubts on how to handle list (i.e. bulleted list) or when some subsections are very short.
This is an example:

 Formato della http request | Installazioni locali | PUT | 
https://miodominio:9004/webapi/risorse/clienti/codice_cliente (aggiorna il cliente: dati nel corpo della richiesta)

So the questions are:

  • first of all do you think this is the right way to process the data?
  • how to handle list or short sections?
  • most in general, what are the best practice?

Thank you in advance :wink: