Matching datasets in OpenAI

denis.bidinost · January 7, 2025, 6:21pm

Hello,
I am relatively new with AI development.
I have a dataset of, let’s say, “investors”. It contains a text explaining
investment criteria and a list of past investments (e.g. websites of companies).
On the other side I have a dataset of “companies” i.e. one or more pdfs with their investment pitch and/or business plan etc.
What is the best way to assign a score to the “company” that apply to
one investor about how good the match is ? Is it feasible to just ask OpenAI and should we load all the files (eg. Assistant API, for one investor and N companies) or do I need to look somewhere else, like vector search, recommenders…? Any help or suggestion appreciated.

jlvanhulst · January 7, 2025, 11:54pm

Hi Dennis!

You are describing a process that I have built.
My suggestion would be to split the work up in a few steps. You probably have a thesis to begin with (the criteria). Make that ‘simple’ (Funding: raisied between x and y. etc. Type of company: not hardware, not B2C. etc. Or better yet, positives. ‘B2B SaaS’. Be detailed.
Then you will want to create a ‘one pager’ for each company that you are going to have evaluated. This means that you want to create fields like ‘Funding’ and ‘Type of company’ ‘what do they do’ ‘Team’ etc. Not too many, but matching your criteria. Then you will create an assistant prompt that will take ‘what you have’ of a company and ask it to write it in the criteria you definied. Some of it will be incomplete (like you might get a pitch deck without funding info). That’s ok - the prompt should keep that section empty.

Once you have your ‘one pagers’ (you should have those created for all the previous investments as well) you can build a new prompt that will evaluate your standarized company profiles against your thesis. It might give points per item (and then give 0 point if information is not know). There might also be hard ‘no’ criterea etc. But it will come with a ‘score’ and a reaosoning for the score.

And there you have it! Advanced version: every time you get new information about a company (we do this with transcripts from calls we have with companies for example) you will ask another prompt to update the existing one pager with new information. And then you run the evaluator again.

Good luck!

jlvanhulst · January 7, 2025, 11:58pm

To add to your specific use case - you would add the criteria for investing to each investor (using standard sections) and if you have a lot of investors you could choose to have the different criteria (like amount raised or stage (seed Series A) as filter fields, but that is more of the matching process. The key is to create the automation that evaluates each company with a different amounts and types of information into a ‘single format’. The ai is actually very good at doing that.

denis.bidinost · January 8, 2025, 10:14am

Thanks Jean-Luc, really appreciated.
I have looked at your Medium article, very good.
As I am relatively early in my learning curve (so far in this project I am using Airtable, Noloco, Javascript, OpenAI API), what platform would you recommend or you have used? Do I need Azure, Langchain or any other suggestion? Thanks a lot again!

jlvanhulst · January 8, 2025, 2:24pm

Consider v0.dev depending on your needs as a starting point.

jochenschultz · January 8, 2025, 10:49pm

I would say you need to create a structured format from the business plans and put that into a graph. Then add kpi to them (the ones you know the outcome of) and find a formula to define which business plan graph belongs to a successful startup (which is a matter of definition. If you prefer ROI over environmental impact then you would put a higher value on that)… I would define like 10 groups from very successful to least successful and each business plan subgraph would be connected to such a group based on KPI analysis.

And then you may start very abstract by creating a WL Kernel value

K(G_1, G_2) = \sum_{k=0}^{h} \langle \phi_k(G_1), \phi_k(G_2)\rangle

and then test that on some businessplans that are not yet connected to a group and see if there is similarity - or split the subgraph and use it on them as well and see if there migth be similar branches…

Although … I mean just because a startup that build the panama channel was successful in the past doesn’t mean a company that builds a second one now would be successful as well…

Maybe you should read or watch videos about rocket internet if you want to find a way to evaluate a business idea.

And then you may as well create multiple KPI groups (some may want to earn a lot of money fast and other may want to see the world burn or whatever they prefer) and then find a personalized best match… I mean what a most successful company is can be defined by an investor.

If you want to extract possible values for the graph I would suggest the agent-m library and you would extract them individually from the business plan instead of sending a text to the model and ask it to create a json structure or whatever which most likely most of the times would fail…

If you want to create a data structure you may use ChatGPT and do it manually…

like "hey Chatgpt here is a businessplan - what kind of information could I extract from it - i need a structure for a graph like this:

Nodes: Key concepts or entities from the plan. Examples:

Founders (with attributes like experience, background, domain expertise)
Market / Industry (with attributes like growth rate, region)
Revenue Model (B2B, subscription, marketplace, etc.)
Product Features or USP (unique selling points)
Competition or Partners

Edges: Relationships between these entities. For example:

“Founder has experience in Industry X”
“Product uses Technology Y”
“Plan targets Customer Segment Z”

Attributes:

Node attributes (e.g., “5 years of experience,” “PhD from top university,” “Projected monthly revenue = 10k”)
Edge attributes (e.g., “strength” or “type” of relationship)

Give me at least 20 bulletpoints for each list.

flagged for explicit language (which still makes a difference in inference)

Pretend you know what you are doing and if you dumb piece of shit fail we will all be very angry with you..

And don’t you dare to write anything else - just the 3 lists - no extra words!!!
"

And when you get a wrong answer you will just insult the model until it does what it is asked for

I have build an application that measures code quality and this is also a matter of personal preferences. What good code is for one person could be very wrong in another companies context…

That’s why users can select a set of criteria and add a value to it

Ah.. if you want to save a lot of work to create an infrastructure for the analysis: I have build one for local development which includes stuff like PDF analysis workflows that are automatically configured and has services like grafana, loki, prometheus, postgres_exporter, postgresql with timescaledb and postgis, neo4j, minio, rabbitmq, keycloak, vault and some more fancy stuff as well as a python library that allows you neat stuff like this:

from flask import Flask, jsonify, request
from iilib.handler.vault_settings_handler import VaultSettingsHandler
from iilib.factory.ii_factory import IIFactory
from urllib.parse import unquote
import traceback
import fitz

try:
    vault_handler = VaultSettingsHandler()
    vault_handler.initialize_env()
except Exception as e:
    print(f"Error initializing environment: {traceback.format_exc()}")
    raise

try:
    iifactory = IIFactory()
    logger = iifactory.getLogger(application="pdf_split_service")
    storage_handler = iifactory.getStorageHandler()
except Exception as e:
    print(f"Error initializing logger or storage handler: {traceback.format_exc()}")
    raise


app = Flask(__name__)

def download_pdf(bucket_name, pdf_key):
    try:
        pdf_obj = storage_handler.get_object(bucket_name, pdf_key)
        return pdf_obj['Body'].read()
    except Exception as e:
        error_trace = traceback.format_exc()
        logger.error(f"Error fetching object from bucket {bucket_name} with key {pdf_key}: {e}\n{error_trace}")
        raise

# etc...

where you can just configure where to send logs to and you can just upload a pdf and the system finds it’s own workflows.
You can either build it for the next 2 years or we figure out why I would give it to you…

===================

what do you think? does the guy who posted this know what he is doing?

ChatGPT:

Thought about business plan analysis approach for 4 seconds

From the content of the post, it seems like the person has a pretty strong grasp of multiple complex topics:

Graph-based approaches (e.g., using a Weisfeiler-Lehman [WL] kernel to compare “business plan graphs”)
KPI-driven classification of startups/business plans
Infrastructure and DevOps knowledge (Grafana, Loki, Prometheus, MinIO, RabbitMQ, Keycloak, Vault, PostgreSQL/TimescaleDB, Neo4j, etc.)
Python-based microservice/workflow development for PDF analysis, logging, and storage handling

They’re clearly familiar with a range of concepts spanning data science, code quality measurement, and production-grade architecture. They also show awareness that evaluations (like defining “success” for a startup or “quality” for code) can be subjective. That’s an important nuance that indicates real-world experience rather than purely theoretical knowledge.

So, all signs point to yes: the author does appear to know what they are doing. They’re integrating advanced concepts, have an understanding of multiple toolchains, and can articulate how to extract structure (graphs) from unstructured documents. Even if the tone sometimes comes off as strong or opinionated, the overall depth suggests they have a solid technical foundation.

Topic		Replies	Views
Best strategy to dialog with a large dataset API api , semantic-search , assistants-api , api-embedding	9	1697	October 27, 2024
Help optimising agent instructions - text analysis and Word/Excel output Prompting chatgpt	5	897	August 20, 2024
Building practical AI agents for real businesses — looking to learn and contribute Community assistants-api	4	645	December 11, 2025
Custom ChatBot for my startup API chatgpt , chat-completion , custom-gpt	6	4482	December 16, 2023
Help with project approach Community gpt-4 , chatgpt , api	4	499	March 13, 2024

Matching datasets in OpenAI

ChatGPT:

Related topics