Realtime API: When do we have multiple contents in an item?

In the realtime API, the content of each item is a list:

'item': {
                'id': 'item_ATXmvvZKYPLtYkuJadG5V',
                'object': 'realtime.item',
                'type': 'message',
                'status': 'completed',
                'role': 'user',
                'content': [{'type': 'input_audio', 'transcript': None}]
            }

However, in practice I have never seen an item content with more than one element. I wonder in which situations this list can contain more than that.

Thanks in advance for your replies.

The content is a collection of multi-modal text and audio content blocks.

Currently, message items on the RT API, with the role user, support input_text and input_audio content. It wouldn’t be far-fetched if, in future iterations of the API, content blocks also support images as well, like the chat completion messages.

1 Like