Seeking Advice: Optimizing Domain Classification Workflow + Issues with Pydantic Schema and JSON Output

Hello everyone,

I’m working on a domain classification problem with the following workflow:

1.	Website Scraping: I scrape website content using libraries like Beautiful Soup and Requests.
2.	Prompt Generation: The scraped content is plugged into a prompt template, which asks the AI model to categorize the website based on its content and score the website according to its entertainment tendency.
3.	Multiple Categories: Since the categories are not mutually exclusive, a domain can have multiple categories assigned to it.
4.	Structured Output: I use LangChain’s JSON output parser, which parses the AI model’s response into a structured output according to a Pydantic schema I’ve set up beforehand.

Questions:

1.	Workflow Optimization: Do you have any suggestions for optimizing the workflow? Specifically, should the scraper integrate the content directly into the prompt, or should I let the LLM browse the website for additional context?
2.	Pydantic Schema Issues: I’m having trouble with the structured output when using the Pydantic schema and JSON output parser. Any tips or best practices for this part of the workflow?

Any insights or suggestions are much appreciated!

Thanks in advance!

Hi @engg !

Sounds fun, and similar to what I have been doing.

Regarding workflow optimization - to me, it sounds perfectly fine what you are currently doing. If a scraped website is lacking context, that’s probably bit of a signal in itself about the website, and maybe that can guide your categorization. I have used Apify in the past, and there are lot of tunables you can configure, like the depth of the crawl, and what kind of info you should strip away, that may influence your results.

Regarding the second question - I suggest you make a new topic just about that, and provide enough information, e.g. current schema/Pydantic model, sample data, etc. This way the community can focus on that specifically, and if we get a good solution, we can tag it as a “solution” so it can help others as well. Thanks!