Creating a chatbot referencing more than 30GB

Hello everyone!
I’m going to develop a chatbot based on my own database whose size is more than 30GB.

And my database has lots of columns like

field_labels = {
“vid_vin”: “Vehicle Identification Number”,
“date_min”: “Started date”,
“date_max”: “End date”,
“days_seen”: “So, days seen”,
“listing_stock”: “Stock”,
“listing_price”: “Price”,
“listing_type”: “Used/New”,
“listing_mileage”: “Mileage”,
“vehicle_year”: “Year”,
“vehicle_make”: “Made Company”,
“vehicle_model”: “Model”,
“vehicle_trim”: “Trim”,
“vehicle_style”: “Style”,
“vehicle_color_exterior”: “Color Exterior”,
“vehicle_color_interior”: “Color Interior”,
“va_seller_id”: “Seller id”,
“va_seller_name”: “Seller name”,
“va_seller_address”: “Seller address”,
“va_seller_city”: “Seller City”,
“va_seller_state”: “Seller State”,
“va_seller_zip”: “Seller Zip”,
“va_seller_county”: “Seller County”,
“va_seller_country”: “Seller Country”,
“va_seller_websites”: “Seller Websites”,
“va_seller_phones”: “Seller Phones”,
“va_seller_type”: “Seller Type”,
“va_seller_makes”: “Seller makes”,
“va_seller_inventory_count_total”: “Seller inventory count total”,
“va_seller_inventory_count_new”: “Seller inventory count new”,
“va_seller_inventory_count_used”: “Seller inventory count used”,
“vdp_url”: “Vehicle Detail Page”,
“ymmt_id”: “Year, Make, Model, and Trim (YMMT) identifier”,
“listing_description”: “Vehicle Description”,
“listing_features”: “Vehicle Features”,
“vehicle_title”: “Vehicle Title”,
“vehicle_subtitle”: “Vehicle Subtitle”,
“vehicle_type”: “Vehicle Type”,
“vehicle_truck_cab_style”: “Vehicle Truck Cab Style”,
“vehicle_truck_bed_style”: “Vehicle Truck Bed Style”,
“vehicle_engine”: “Vehicle Engine”,
“vehicle_engine_size”: “Engine size”,
“vehicle_engine_cylinders”: “Cylinzers”,
“vehicle_transmission”: “Vehicle Transmission”,
“vehicle_transmission_type”: “Transmission Type”,
“vehicle_transmission_speed”: “Transmission Speed”,
“vehicle_drivetrain”: “Vehicle Drivetrain”,
“vehicle_doors”: “Doors”,
“vehicle_fuel_type”: “Fuel Type”,
“vehicle_fuel_efficiency”: “Fuel Efficiency”,
“vehicle_fuel_efficiency_highway”: “Fuel Efficiency highway”,
“vehicle_fuel_efficiency_city”: “Fuel Efficiency city”,
“vehicle_history_description”: “History Description”,
“vehicle_history_critical_count”: “History critical count”,
“vehicle_history_accident_count”: “History accident count”,
“vehicle_history_theft_count”: “History theft count”,
“vehicle_history_salvage_count”: “History salvage count”,
“vehicle_history_service_count”: “History Service count”,
“vehicle_history_owner_count”: “History owner count”,
“portal_urls”: “Portal Urls”,
“portal_deal_ratings”: “Portal Deal Ratings”,
“portal_days_online”: “Portal Days Online”,
“portal_prices”: “Portal Prices”,
“portal_titles”: “Portal Titles”,
“vdp_url_last_crawled”: “Vehicle Detail Page Url Last Crawled”,
“vehicle_history_report_urls”: “Vehicle history report urls”,
“va_seller_portal_websites”: “Seller portal websites”,
“va_seller_portal_urls”: “Seller portal Urls”,
}
As you can see here, it has more than 40 columns. And tons of products.
I’m going to develop a chatbot that referencing the database, and find the most relevant product or suitable products in order to help the customers.

I’ve already built a chatbot with openai embedding, but the accuracy is not good enough.
I’d really appreciate if anyone help me to develop this chatbot.

Welcome to the forum.

What, specifically, have you tried so far? What did you embed as vectors? (ie how much in each one…)

Thanks for your reply.
Actually, I used the postgresql, and I got the data with this function :

def text_from_row(row):
field_labels = {
“vid_vin”: “Vehicle Identification Number”,
“date_min”: “Started date”,
“date_max”: “End date”,
“days_seen”: “So, days seen”,
“listing_stock”: “Stock”,
“listing_price”: “Price”,
“listing_type”: “Used/New”,
“listing_mileage”: “Mileage”,
“vehicle_year”: “Year”,
“vehicle_make”: “Made Company”,
“vehicle_model”: “Model”,
“vehicle_trim”: “Trim”,
“vehicle_style”: “Style”,
“vehicle_color_exterior”: “Color Exterior”,
“vehicle_color_interior”: “Color Interior”,
“va_seller_id”: “Seller id”,
“va_seller_name”: “Seller name”,
“va_seller_address”: “Seller address”,
“va_seller_city”: “Seller City”,
“va_seller_state”: “Seller State”,
“va_seller_zip”: “Seller Zip”,
“va_seller_county”: “Seller County”,
“va_seller_country”: “Seller Country”,
“va_seller_websites”: “Seller Websites”,
“va_seller_phones”: “Seller Phones”,
“va_seller_type”: “Seller Type”,
“va_seller_makes”: “Seller makes”,
“va_seller_inventory_count_total”: “Seller inventory count total”,
“va_seller_inventory_count_new”: “Seller inventory count new”,
“va_seller_inventory_count_used”: “Seller inventory count used”,
“vdp_url”: “Vehicle Detail Page”,
“ymmt_id”: “Year, Make, Model, and Trim (YMMT) identifier”,
“listing_description”: “Vehicle Description”,
“listing_features”: “Vehicle Features”,
“vehicle_title”: “Vehicle Title”,
“vehicle_subtitle”: “Vehicle Subtitle”,
“vehicle_type”: “Vehicle Type”,
“vehicle_truck_cab_style”: “Vehicle Truck Cab Style”,
“vehicle_truck_bed_style”: “Vehicle Truck Bed Style”,
“vehicle_engine”: “Vehicle Engine”,
“vehicle_engine_size”: “Engine size”,
“vehicle_engine_cylinders”: “Cylinzers”,
“vehicle_transmission”: “Vehicle Transmission”,
“vehicle_transmission_type”: “Transmission Type”,
“vehicle_transmission_speed”: “Transmission Speed”,
“vehicle_drivetrain”: “Vehicle Drivetrain”,
“vehicle_doors”: “Doors”,
“vehicle_fuel_type”: “Fuel Type”,
“vehicle_fuel_efficiency”: “Fuel Efficiency”,
“vehicle_fuel_efficiency_highway”: “Fuel Efficiency highway”,
“vehicle_fuel_efficiency_city”: “Fuel Efficiency city”,
“vehicle_history_description”: “History Description”,
“vehicle_history_critical_count”: “History critical count”,
“vehicle_history_accident_count”: “History accident count”,
“vehicle_history_theft_count”: “History theft count”,
“vehicle_history_salvage_count”: “History salvage count”,
“vehicle_history_service_count”: “History Service count”,
“vehicle_history_owner_count”: “History owner count”,
“portal_urls”: “Portal Urls”,
“portal_deal_ratings”: “Portal Deal Ratings”,
“portal_days_online”: “Portal Days Online”,
“portal_prices”: “Portal Prices”,
“portal_titles”: “Portal Titles”,
“vdp_url_last_crawled”: “Vehicle Detail Page Url Last Crawled”,
“vehicle_history_report_urls”: “Vehicle history report urls”,
“va_seller_portal_websites”: “Seller portal websites”,
“va_seller_portal_urls”: “Seller portal Urls”,
}
formatted_data = “”
for field_name, label in field_labels.items():
if field_name in row:
formatted_data += f"{label}: {row[field_name]}\n"

return formatted_data

def get_embedding(text):
try:
response = openai.Embedding.create(
model=“text-embedding-ada-002”, input=text.replace(“\n”, " “)
)
embedding = response[“data”][0][“embedding”]
embedding_array = np.array(embedding)
return embedding_array
except Exception as e:
print(f"An error occurred: {str(e)}”)
return None

And this is the function that gets the embedding from the formatted_data.
I used pgvector.
That’s all.
What I want is if the customer ask “Recommend the most expensive white car.”
In that case, I want the chatbot to reply the most expensive white car.

Vehicle Identification Number: KM8J3CA27HU574933
Started date: 2023-08-27
End date: 2023-09-18
So, days seen: 23
Stock: G23094C
Price: 19898
Used/New: Used
Mileage: 66595
Year: 2017
Made Company: Hyundai
Model: Tucson
Trim: Value
Style: SUV
Color Exterior: White
Color Interior: Gray
Seller id: mansfield-buick-gmc_mansfield_oh
Seller name: Mansfield Buick Gmc
Seller address: 1400 Park Ave W
Seller City: Mansfield
Seller State: OH
Seller Zip: 44903
Seller County: Richland County
Seller Country: USA
Seller Websites: https://www.mansfieldbuickgmc.com
Seller Phones: 419-529-3211
Seller Type: Franchise
Seller makes: Buick,GMC
Seller inventory count total: 116
Seller inventory count new: 46
Seller inventory count used: 70
Vehicle Detail Page: 2017 Hyundai Tucson Value Mansfield OH | Mansfield Buick GMC KM8J3CA27HU574933
Year, Make, Model, and Trim (YMMT) identifier: 2017_hyundai_tucson_se
Vehicle Description:
Vehicle Features:
Vehicle Title: 2017 Hyundai Tucson
Vehicle Subtitle:
Vehicle Type: SUV
Vehicle Truck Cab Style:
Vehicle Truck Bed Style:
Vehicle Engine: 1.6L I4 16V GDI DOHC Turbo
Engine size: 1.6
Cylinzers: I4
Vehicle Transmission: 7-Speed Automatic with Automatic-Shift
Transmission Type: Automatic
Transmission Speed: 7
Vehicle Drivetrain: All-wheel Drive
Doors:
Fuel Type: Gasoline Fuel
Fuel Efficiency:
Fuel Efficiency highway:
Fuel Efficiency city:
History Description:
History critical count: 0
History accident count: 0
History theft count: 0
History salvage count: 0
History Service count: 0
History owner count: 0
Portal Urls: Autotrader - page unavailable
Portal Deal Ratings: cars.com:GOOD,edmunds.com:FAIR
Portal Days Online:
Portal Prices: cars.com:19898,carsforsale.com:19898,edmunds.com:19898,autotrader.com:19898
Portal Titles: cars.com:2017 Hyundai Tucson Value,carsforsale.com:2017 Hyundai Tucson,edmunds.com:2017 Hyundai Tucson,autotrader.com:Used 2017 Hyundai Tucson Value w/ Cargo Package
Vehicle Detail Page Url Last Crawled: 1695009600
Vehicle history report urls: https://www.carfax.com/VehicleHistory/ar20/dC5F4h8tbFWHuAIKl_wKB17l2if2yfIGzMLEhAufbJHQcfGRpKF9p1sbfBR7JFDfQZzYjOnqCZP5KEoXqYENlxlVQx7XojttHAU
Seller portal websites: cars.com:mansfieldbuickgmc.com;edmunds.com:mansfieldbuickgmc.com;autotrader.com:mansfieldbuickgmc.com
Seller portal Urls: Autotrader - page unavailable

This is the style of one product.

Seems like SQL would be better to handle that… then send the data to the LLM API to get a human-like response using the data you got back from the SQL query…

There’s too much “noise” in this dump to be vectorized correctly… Stick with SQL for the query then format an answer with the results…

1 Like

Thanks for your help.
What I feel concerned is the quality of SQL query.
Can you recommend any framework or a method to do this quite well?

I know how to generate sql query with openai.

But can you tell me the quality of it?
Or the method to enhance the quality.
Or other framework is OK.

I’m not sure how easy it would be to go from plain English request to SQL, but I believe there’s a few posts about it here on the forum. You might try with a two-shot example … even if it’s close, you might be able to massage it with regex…

Look into functions (relatively new) which can be called now…

Thanks very much.

Do you know about ln2sql?
I read an article about it.
Can you tell me whether it’s possible to use ln2sql to generate sql query with combining openai.

I’ve not tried it, but looks kinda old… Might give it a try?

Setting up a good system prompt with a few examples might do better… or searching forum as I’m sure a few people have tried natural language to SQL before.

I’m trying to show you how to fish, but @_j might show up with a pile of fish on a silver platter given time! :wink:

You’re on the right path… I’d start experimenting some after searching for previous attempts by others… We’ve got a great resource on this forum, but it can take a bit of time to search around and learn from the mistakes and successes of others.

Good luck!

ETA: Here’s a few…

https://community.openai.com/search?q=natural%20language%20sql

I’ve not needed that functionality, so I haven’t followed closely. The path has been traveled before, though!

2 Likes

Thanks for your kind help.

Can you recommend a book or a website for generating SQL query with NLP?

I’ve read many articles related to this.

This is helpful, but does it generate queries for about 40 columns correctly?

What I mean, is GPT is powerful for complex queries?

Obviously functions are needed here. They could be direct access to SQL, but the AI could easily overwhelm itself with data by doing dumb queries.

I would identify the top tasks, and then make them into more general and reusable SQL tasks, and then figure out how to approach them as “top 10 results” or “best of” results, or booleans, so you can offer functions that have the breadth of the database but limited returns only for criteria.

Could AI write queries? Not with imagination, I ask for more general purpose queries and types of functions they could power, and just basically get my prompt repeated back to me, but in AI language:

Sure, here are my recommendations:

  1. SQL Interface Function:
    This function will directly interface with your SQL database and execute the query provided as a parameter. It will return the result of the query.
def execute_sql_query(query: str):
    # Connect to your database
    # Execute the query
    # Fetch the result
    # Return the result
  1. Task-based Subfunctions:
    These functions will translate high-level tasks into SQL queries and use the execute_sql_query function to execute them.

a. get_most_expensive_car - This function will return the most expensive car in the database.

def get_most_expensive_car():
    query = "SELECT * FROM vehicles ORDER BY listing_price DESC LIMIT 1"
    return execute_sql_query(query)

b. get_highest_accident_count_car - This function will return the car with the highest accident count in the database.

def get_highest_accident_count_car():
    query = "SELECT * FROM vehicles ORDER BY vehicle_history_accident_count DESC LIMIT 1"
    return execute_sql_query(query)

c. get_car_details - This function will return the details of a car given its Vehicle Identification Number (VIN).

def get_car_details(vin):
    query = f"SELECT * FROM vehicles WHERE vid_vin = '{vin}'"
    return execute_sql_query(query)
  1. Prompting Technique:
    To prepare the AI for using these functions, you can use a prompting technique where you first introduce the AI to the structure of the database and the list of fields. This can be done through a function that returns the schema of the database.
def get_database_schema():
    # Connect to your database
    # Fetch the schema
    # Return the schema

You can then use this schema to guide the AI in formulating the correct SQL queries. For example, if the AI needs to find the most expensive white car, it can first check the schema to find the appropriate fields (e.g., listing_price and vehicle_color_exterior), and then construct the SQL query accordingly.

1 Like

Thanks very much for your detailed explanation.

What if you submit a request for In which the text is strictly written "This is how the request from the client {Request} generate Request to the database with such fields (list of fields). "Send a response from the database to the client.

How many functions can I add?

To maintain the accuracy of my bot?

Can I ask one thing?

If the customer ask ‘Recommend the black car that the price is over 3000$’. In that case, how can I set the function description for this?

The “function description” is the text that the AI sees, telling it why a particular function is useful.

The question you pose is one that doesn’t seem likely answered by a SQL function.

It seems like you would get a massive return, and none of them are “recommendation”:

  • if car is black;
    and
  • if car price is over $3000;

then enjoy this list of 100000 cars???

You need another criteria.

Is this possible?

I divide the price into two parts.
Like minimum, and maximum.

And set function parameter as like this:
“maximum_price”: {
“type”: “number”,
“description”: “The maximum price limit of the car, if not mentioned, it’s 100000000”,
},
“minimum_price”: {
“type”: “number”,
“description”: “The minimum price limit of the car, if not mentioned, it’s 0”,
},
“color” : {
“type”: “string”,
“description”: “The color of the vehicle’s exterior, such as ‘black’, ‘red’, ‘white’, ‘silver’, ‘Satin Steel Metallic’, etc”,
},
}
How’s this?

I think you need to approach the actual question you are trying to answer, and how it can be answered in under 20 results that can be understood by an AI, in a considered and thought out method for producing the query that powers a function.

This forum and my expertise is beyond teaching “how to approach structured queries and data post-processing for question-answering on massive datasets”.

This little trick could definitely help with your mileage: break the big problem into little pieces. Collect information from the customer to guide the model to a better understanding of what is actually important.
And what this means in reality is: guide the customer to understand what the actual need is.
You are pretty much doing the same sales pitch as any other experienced car salesman would do. Need a family car or a car to help you find a partner to start a family? How much money do you want to spend? What’s your (partner’s) favorite color?

From there you can query your database one by one and focus on the solutions that actually matter.