Hello everyone,
I’m working on a domain classification problem with the following workflow:
1. Website Scraping: I scrape website content using libraries like Beautiful Soup and Requests.
2. Prompt Generation: The scraped content is plugged into a prompt template, which asks the AI model to categorize the website based on its content and score the website according to its entertainment tendency.
3. Multiple Categories: Since the categories are not mutually exclusive, a domain can have multiple categories assigned to it.
4. Structured Output: I use LangChain’s JSON output parser, which parses the AI model’s response into a structured output according to a Pydantic schema I’ve set up beforehand.
Questions:
1. Workflow Optimization: Do you have any suggestions for optimizing the workflow? Specifically, should the scraper integrate the content directly into the prompt, or should I let the LLM browse the website for additional context?
2. Pydantic Schema Issues: I’m having trouble with the structured output when using the Pydantic schema and JSON output parser. Any tips or best practices for this part of the workflow?
Any insights or suggestions are much appreciated!
Thanks in advance!