Hi all, I’m using langchain’s html loader to extract structured info from pages of a particular type.
structured_schema = {
"properties": {
"event_title": {"type": "string"},
"event_date": {"type": "string", "format": "date"},
"All_Artists" : {"type": "string"},
"Start_time": {"type": "string"},
"City": {"type": "string"},
"URL": {"type":"string", "description": "extract current URL or og:URL canonical URL from the page"},
},
"required": ["event_title", "event_date"],
}
create_extraction_chain( structured_schema, llm)
extraction_chain.run(data)
When I pass it a candidate page that has structured Meta info tags, I get the current URL as exampledotcom/<extracted_part>
Can someone help me understand what is happening? Is this a restriction in the model? How do you specify in the structured schema what each extracted field’s extraction logic should be. I tried the description field, e.g. in the URL but there’s no change in the LLM output.