Which model is used for working with video (for example get text from video and so on)?

Hello, there are detailed prices for difference models. I would like to ask which model is used to extracting data from video. For example video will be uploaded and I will need to extract all text in video (Video will be input, text will be output).
Thank you