Fine tuning for free text to structured representation

I want to convert the text of a single camera shot from a screenplay and generate a machine readable summary (JSON or easily parsable text) - the characters, the pose of each character, where should the camera be placed etc.

Example input:

Sam is sitting at his desk, sad. Side closeup shot from the left.
   -->
CHARACTER Sam, EXPRESSION sad, POSITION at-desk, BODYPOSE sitting.
CAMERA LOOK AT Sam FROM left, SHOT close-up.

This is to speed up the layout of an computer generated animation (put the required characters into a shot and roughly position them to reduce the manual effort to set up the shot).

I tried classification (“what is the pose in this shot”), but it needs to know the pose of a specific character.

I tried using “text-to-command” in the playground on short examples and it worked pretty well (see text-to-command playground). So I tried to fine tune the model… not initially realizing “davinci” is NOT shorthand for “text-davinci-003”. Fine tuning “davinci” was not useful, and “text-davinci-003” cannot be fine tuned (How do davinci and text-davinci-003 differ? | OpenAI Help Center). So it did not work.

So what is the best way to generate machine readable instructions like the above from text? Structured summarization perhaps? Or do I just inject learning examples at the start of every text-davinci-003 prompt? There are about 20 properties of characters I want to extract (when present), and probably 10 for cameras (where there can be panning shots with different settings at the start and end of the camera pan).