What is the best way of getting OpenAI API to respond with more specific & statistical responses related to Financial Markets?

I am building an app related to Financial Markets and I want OpenAI API to respond with more specific knowledge of the events that have happened in financial markets and stating facts/prices of various stocks. I know for sure that we need to fine tune the base model to make it aware of the events happening.
For training, we have some news sources APIs that regularly fetch the latest news and events daily, and I have a lot of other training data as well which includes graphs(images), PDFs etc that I want to feed into ChatGPT for it to learn.
First of all my question is which approach is better in my usecase? Should I use the RAG approach? or the Fine-tuning approach? or may be a Hybrid approach that uses both?

Secondly, in the documentation I saw that fine-tuning can only be done using the JSONL files that have question answer pairs. Another thing that I have in mind is that, we will require a lot of manual effort to create these JSONL files ourselves and I was thinking of automating this task by using a Python Script to Create a JSONL file for me using the data that I provide. Is this a valid approach? Because a script is gonna ignore the semantics of the data in a file. Also there will not be some meaningful prompts and completion pairs when using a script, because obviously a script wouldn’t know what questions to put in the file.

Anyone had a similar experience? Please share your thoughts. Thanks

Hi @zainanwar6234 and welcome to the community!

Since markets are dynamic, you want up-to-date knowledge, so finetuning is not the best approach. Also, the models inherently have a good grasp of macroeconomics, so they can respond appropriately, given the right data and context.

I would start with a simple approach where, for a given ticker, company, or sector, you simply fetch the latest (or historical) data from your data sources, and combine that data with a nice tight system prompt, to produce the necessary output. You are most likely able to do this in a single API call.

If you want more finesse, e.g. focusing on different aspects, like getting the alpha, derivative insights, sentiment, etc, I would just define functions with appropriate system prompts and data sources for each of those.

Hope that helps!

1 Like

What is fine tuning then? And can we somehow increase the knowledge base of the model? I am talking about the specific model that we are using through our API key. I am not referring to RAG approach - whereby a system prompt is sent with information/article that is relevant to the prompt - rather I am wanting to know can we train the model just by feeding it information/articles that are relevant to our domain specific usecase.

OpenAI provides a very good guide on fine tuning, and describes when you should use finetuning here. In essence, the recommendation is to really try the prompting strategies and problem breakdown as what I described previously.

The trouble with finetuning is that you are trying to add your custom knowledge to the model, which is guaranteed to be many orders of magnitude smaller than the knowledge in the base model itself. If your custom data is not out of distribution, and if you don’t have a huge amount of it, and of high quality, the risk is that it will simply be “lost” among the billions of models’ weights. Since GPT models are based on massive web crawls, they possess significant amount of specialized knowledge, across engineering, sciences, legal, and definitely finance and economics. What finetuning can do is for example change the style and format of responses (if this is tricky to steer with prompting), but in terms of adding new knowledge, it is super tricky to get this right. Even the likes of Bloomberg tried, and failed.

Just to clarify, we are going to have a specific and precise format for our queries, like defined system prompts & we will use fine tuning to train the LLM to give us outputs in the desired formats that we want. However, we intend to use as our prompt and output examples in the JSONL files new information that one-to-one corresponds to the prompt and answers that the user will be giving and getting. I hope this is clear and if so the question is: Do you think this approach is likely to enable us to generate the answers we want on the new information that we have provided in the training?

Ok I see. My answer is: possibly :sweat_smile:! I don’t dare give a more precise answer than that - lot of this is just pure alchemy, and you just have to try it out and evaluate. If you are sure prompting strategies don’t get you there, and you have plenty of finetuning samples (thousands of samples), then why not?