Using create_extraction_chain but getting "example.com" urls back from OAI apis

Hi all, I’m using langchain’s html loader to extract structured info from pages of a particular type.

structured_schema = {
    "properties": {
        "event_title": {"type": "string"},
        "event_date": {"type": "string", "format": "date"},
        "All_Artists" : {"type": "string"},
        "Start_time": {"type": "string"},
        "City": {"type": "string"},
       "URL": {"type":"string", "description": "extract current URL or og:URL canonical URL from the page"},
    },
    "required": ["event_title", "event_date"],
}
create_extraction_chain( structured_schema, llm) 
extraction_chain.run(data)

When I pass it a candidate page that has structured Meta info tags, I get the current URL as exampledotcom/<extracted_part>

Can someone help me understand what is happening? Is this a restriction in the model? How do you specify in the structured schema what each extracted field’s extraction logic should be. I tried the description field, e.g. in the URL but there’s no change in the LLM output.

This happens when you pass a relative URL to the API. You can probably validate the output if it includes example.com and change it to the correct one or always make sure to use absolute URLs.

1 Like

Thank you. To clarify, if the output includes example . com, then replace that with the
rel=canonical + / part-after-the-slash

Where I have parse from head or meta the current URI of the page.