Transitioning from Assistant API to ChatCompletions or LangChain: Seeking Guidance and Suggestions

Hi Everyone,

I’m an enthusiast AI developer currently learning and implementing. I have developed a few RAG models using the Assistant API and built sample ones using Langchain and Langflow as well.

I have become too comfortable with the Assistant API because of the ease of integration it provides, but I need more control over my database and everything. So I have started building an advanced version of a RAG model using Langchain, OpenAI API, and Chroma DB. I want to implement function calling, file search, and vision as well. I’m mostly using GPT-4o and o1 models.

On another end, I’m building a prototype for a web scraping agent where I provide HTML content of the page and the website’s full screenshot to analyze what components are present on that webpage. I have already built a POC that works for 1 or bulk URLs. I have used the Assistant API for this but am now integrating it with Langchain. This will be a separate agent using the 4o model. The problem right now is that it’s kind of difficult to feed in a huge amount of extracted HTML to the LLM. That’s why I’m chunking it down and generating a report, which is working fine. But is there any other way I can implement it?

I need your help with some suggestions, reference resources, or codebases I can refer to in order to build an advanced version of a RAG model using Langchain.

Your help will be really appreciated.

1 Like

Hey there and welcome to the forum!

Sounds like a fun question!

So you’re using langchain for this? I’ll be honest langchain gives me pretty mixed results, and I’d typically recommend making the switch over to straight python or another language altogether.

Might I recommend the beautiful concept of multithreading?

The problem you’re facing is one of speed and efficiency I’m guessing. The thing is, that’s langchain’s biggest bottleneck imho. This is a very good problem to have though, because I believe when you start needing to address and solve these kinds of questions, you’re transitioning from an intermediate programmer to an advanced one.

I would either look up or ask ChatGPT to discuss the topics of asynchronous programming and multithreading. These concepts are your missing links that should allow you to build a tool that can run several “batches” or other tasks simultaneously.

Hi, thank you for the reply, just wanted to provide a clarification. For the HTML data, that is, for component identification analysis, I’m using ChatCompletion API from OpenAI, uses o1 model with vision.

So the flow right now is,
Extract html using bs4 + full page screenshot of the page using selenium
Provide the image and first set of chunk (html data), try to identify the components
Then send the second set of chunk, identify the components, and so on.

Then consolidate both into a final response where it generates a well structured report.

This is done purely using ChatCompletions API.

The reason I’m chunking the data is because of the context window, I send the first set of 100000 tokens, then send another 100000 tokens and then generate the report.

the total context window of o1 is 200k and I couldn’t put the html data, if huge, in one go, so I chunked it.

Tho alternatively I have tried usind firecrawl, etc. it didn’t work out. So I’m using bs4, will later shift to other library like scrapy. Its still a prototype!

I hope you got this, please note I’m not using langchain for the html analysis part.

I did try this with assistant api, when on playground, done manually. It works but via streamlit app I built it only accepts upto 256000 character.

The html data of certain webistes go beyond 600000, if I copy paste this data into the playground the assistant api goes nuts giving some random reply.

Might I recommend the beautiful concept of multithreading?

And yes, I’ll eventually be implementing multithreading to achieve faster execution. I’ve implemented this concept multiple times when developing scripts!

Ah, okay. Gotcha. Now I see what your issue is.

So, with the way you’re feeding the data, are these chunks necessarily dependent on one another? As in, if you send chunk 1, are you waiting for further processing from o1 before sending chunk 2, or are you siloing them out?

You might be able to send these chunks at the same time asynchronously, and once the responses all come back process the report. Otherwise, you can’t really escape the context window limit. That’s unfortunately just the fundamental limitation of the resource available to us right now.

Some other models, like Gemini, are built for high context window lengths, but they notably don’t have reasoning. Therein lies the tradeoff right now. Do you genuinely need that reasoning for the HTML processing? If not, go with something that has 1M+ context window length like Gemini, and use o1 for more intelligence-intensive tasks. If you do need that reasoning for this HTML stuff, you’re already doing much of what we would recommend here. The balance would lie in figuring out how to asynchronously cast these API requests to increase the speed and processing times as much as possible.

Yess, they are dependent on each other, after the chunked analysis, then, take in those responses and generate a final report in a well structured format, which then is provided to the RAG model for further analysis.

I’ll give this a shot tho I’m afraid those will be 2 separate instances instead of one.

And the only reason I’m using o1 is because of the reasoning and its vision capabilities, its awesome, works well with the usecase.