Image based inference to gpt4o appears to be severely impaired due to excessively naive token counting

mdear · October 27, 2024, 11:41pm

Before I post several walls of text showing the grisly details, let me summarize.

I’m processing vocational training videos in the wheelchair custom seating vertical. I’m pulling out subtitles from the English audio track, translating those subtitles into other languages, and then introducing these subtitles along with keyframes automatically qualified and pulled from the video (avoiding keyframes that are blurry or in the middle of scene changes). I ask the AI to produce summaries of various types, including describing the provided images in particular detail.

I started by providing an actual working Dropbox link (the long-lived links whose expiry time I can control via my Dropbox API) which comes up in my browser and has a content disposition similar to the link provided in the cookbook example for gpt4o (no download initiated, just display the raw image in the browser).

I was surprised when the AI acted as though it could not see the contents of the video behind the link, despite it being a public and non-expired link that I could bring up in my own browser.

I also uploaded the image to ChatGpt gpt4o and was given a rich and correct summary of the image, in stark contrast to the hesitant dialog coming out of the API inference call. After trying several dozen variations I finally had to give up on providing inline image URLs via Dropbox.

I then moved to inline base64 encoded pngs, and I quickly hit API and throughput rate limits, so I had to throttle back so significantly that the economics no longer made sense (I have a system context of 4,000 tokens and several hundred keyframes for a five minute video, yet all that would pass the rate limits was an inference request with three subtitles (in each supported language) and a single keyframe !?!?

I still hadn’t gotten any usable output from the AI, so I moved to sending jpeg inline base64 content in the hopes that the compression would allow me to get at least one good inference reply and work around some of these rate limits.

I then found that the API did naive token counting that made my use case practically useless, I think this is a bug. Here’s how it goes : as I understand it, when the API receives an inline image (either URL based or inline base64 encoded) it produces a vector (or series of vectors in ‘detail’ mode which I am using) that significantly compress the token cost of that image from many hundreds of thousands to just a few thousand tokens. Thus, my API call may be many hundreds of thousands of tokens in length “on the wire”, but the model won’t see nearly that many tokens.

I did a naive local count using tiktoken of the user and system prompts and that number was exactly what the API was bouncing back to me in its failure message, indicating that it was also doing naive counting.

I should be able to do inference on somewhere between 30 and 70 videos in a single call, but I’m lucky if I can squeeze in just a single image.

I just leveled up to Tier 3 and I’m getting acceptable inference results for a single keyframe and set of three subtitles, but this isn’t anywhere near scalable enough for my calls to make good business sense.

I’ve been sending these details to support, but there only appears to be an AI at the other end of the support requests. I have requested human intervention to confirm and squash this bug. Any suggestions on how I can unblock myself when I have a customer waiting ?

I tried both jpeg and webp encodings, and saw the same bug in both these cases.

I appreciate any suggestion the community may have identifying any rock that I may have left unturned. Respect.

mdear · October 27, 2024, 11:48pm

The following is a capture of an email chain with “support”:

This time I tried encoding my inline images as webp, and I found once again that the token count reported in the API error message made it clear it was made naively token counting the actual system and user prompts sent in via the API (which include inline base64 image reporesentations) and not instead incorporating the token counts actually expected to be sent to the model (since as I understand it the vectorization of an image will reduce many thousands of tokens to only around 2,000 tokens, even in detail mode).

I do not think I should be forced to make hundreds of inference calls to process hundreds of images individually, as this does not scale, especially since my system context is relatively large (4,000 tokens or so). I would much rather budget my input token consumption as per your published vision API rather than put a x60 “fudge factor” in to ensure your naive token counter does not exceed the full window of 128,000 when counting towards that budget tokens of the inline base64 image instead of the resulting vectorized form which is orders of magnitude more compressed.

Please respond. Anyone? I’m fast running out of options here.

Thanks,

On 2024-10-27 12:58 a.m., Myles Dear Hotmail wrote:

Now, I’m only packing three subtitles (one for English, French and Spanish) and a single keyframe along with the required system prompt, and I’m getting “string too long” errors.
string too long. Expected a string with maximum length 1048576, but got a string with length 2237388 instead.

Whaaaaaat? I am using your image based API passing in a supported file format with a supported image type that has a supported resolution and your API can’t handle the length !?!?!?!?

My keyframes are 1920 x 1080 and I am uploading them as PNGs so there is no loss in fidelity.

See below for the failure I saw when I tried to convert the PNG content to JPEG in order to save space, I then got input token cap exceeded errors!?!?

This model’s maximum context length is 128000 tokens. However, your messages resulted in 277597 tokens.

There is no way the tiktoken count for the system prompt plus the tiktoken count for the input subtitles as well as the input token budget for the vectorized form derived from the input jpeg inline base64 image (specified in your vision API documentation) comes up to that much. It appears your API is counting the JPEG inline image towards the input token count instead of the vectorized version of that image.

Token usage as counted by tiktoken blindly by adding the system and the user prompt (which contains the inline jpeg content that, as I understand it, the OpenAI API will transform to a vectorized version that consumes much less token space before handing off to the model):
277,586

This evidence and calculation suggests your API is blindly counting the user prompt before the image transformation (and token space compression) step.

Token usage as estimated given the token sizing details from the Vision API:
System prompt : 4,370

User prompt breakdown:
Subtitle 1 brings input token count to 4,419
Subtitle 2 brings input token count to 4,478
Subtitle 3 brings input token count to 4,526

1,445 tokens for the first image (85 + 170 * 8 as per vision documentation) which brings token count to 5,971

So, once again, I find myself blocked and my customer is still waiting.

What will you do to unblock me?

> 2024-10-26 21:07:27,881 DEBUG: Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': '\n You are tasked with processing wheelchair seating vocational training videos. Your goal is to analyze and summarize the video. \n\n The expected output will include:\n - "file_commentaries": Commentary about the video as a whole.\n - "subtitles": Subtitles extracted from the video, tied to specific timestamps.\n - "image_summaries": A summary of keyframes extracted from the video, based on timestamped keyframe URLs.\n - "commentaries": A section that ties together insights derived from subtitles and keyframes.\n - "source": Details about the video source, including whether it is human-created or AI-generated.\n - "metadata": Any additional metadata required to describe the context.\n\n The expected input will include:\n - A list of keyframe URLs, localized by timestamp, that represent important visual elements of the video.\n - A set of subtitles, localized by timestamp, that convey important spoken content.\n - Summaries produced by previous inference runs on a given video (empty for the first inference run)\n\n Input and output formats are valid JSON objects.\n\nA few specific details to consider:\n- You will be presented with a chronological series of subtitles in multiple languages. \nFor each language, ensure you rebuild the narrative to ensure continuity is maintained. For example, if one subtitle says "It\'s really cool actually how you can convert a body point padded one" and the next subtitle says "and a half inch belt with plastic side released buckle on it" you should come to the conclusion that the trainer is showing a one and a half inch belt, not a half inch belt. Precision is extremely important to maintain when you are summarizing. \n\n\n\n\nHere is a list of abbreviations that the content producer uses to compose video filenames and series names. It is in a form of a table in which the abbreviation is further explained. Use this information to better understand the intention of any video or series name.\n\nAbbreviation List\nASBS : ASSEMBLY BRACKETS STEEL (FLATS)\nASBA : ASSEMBLY BRACKETS ALUMINUM (FLATS)\nBC : BACK COVER\nBCU : BACK CUSHION\nBINTH : BACK INTERFACE HARDWARE\nBINT : BACK INTERFACE\nBT : BLACK TRAY\nCFS : CALF SUPPORT\nCH : CUP HOLDER\nCHAC : CHAIR ACCESSORIES\nCHR : CUSTOM HEADREST\nCOHR : COMMERCIAL HEADREST\nCH : CUP HOLDER\nCOB : COMMERCIAL BACK\nCUP : CUSTOM POSITIONING STRAP\nCOP : COMMERCIAL POSITIOINING STRAP\nCOS : COMMERCIAL SEAT\nCT : CLEAR TRAY\nCOMP : COMPRESSION SPRING\nFTB : FOOTBOX\nFTR : FOOTREST\nFIP : FOAM IN PLACE\nHLT : HINGED LAP TRAY\nLAT : LATERAL\nMOB : MOLDED BACK\nMODS : MODIFICATIONS\nPAD : HANGERS AND ARMREST\nP - bjb : Portable bottle jack bender\nREP : REPAIRS\nROHO : SEAT AND BACK BOLSTERS\nSB : SUPPORT BRACKETS\nSC : SEAT COVER\nSCU : SEAT CUSHION\nSINT : SEAT INTERFACE\nSKI : SIT SKI\nSLF : SLENDERFENDER FIT KIT\nTEM : TRAY EASY MOUNT\nTSEM : THREAD SLED EASY MOUNT (TOOL LESS ADJUSTMENT)\n\n\nThe user prompt (including the contents of the content/text block of the user role) shall be in JSON format, as per the following specification:\n\nThe user input prompt format is as follows:\n{\nrole: user,\ncontent : [\n{type: text ,\ntext : "{\n\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video\'s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a cumulative summary of the video from its beginning to the currently analyzed video chunk. It may be empty if this is the first chunk of the video being analyzed and no prior inference has produce commentaries thus far. This input consists of previously generated summary output from the current video�s previously analyzed chunks (if any) to ensure all file summary input given so far is considered when asking for a new summary to be built considering the current video chunk. \n{file_commentary : # A string field containing a cumulative summary of the video, ultimately ensuring that the file is summarized using all subtitle and image keyframe data presented this far. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\nsubtitles: [ # This list may be empty if the earliest image timestamp is less than the first subtitle timestamp.\n {timestamp : #string, in srt format\n subtitle : #contents of subtitle generated from the audio track of the video \n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\nimage_summaries: [# Initially empty, this contains a summary of each image analyzed thus far. The summary must be based on the visual elements of the actual image or images included and must only describe actual visual elements present in the images. The number of objects in this list for a given timestamp is expected to be the number of languages for which subtitles are provided in the user input context. The purpose of this list is to provide context to generate commentaries and cumulative file summaries that draw from multimodal subtitle and image inputs. If this is the first chunk of a video this field may be empty. \n{image_summary: # A detailed textual summary of the image that would contain enough information to teach a skilled intern how to accurately and correctly imitate the skill being demonstrated. Include the items seen in the image, the tools being used, the products being used and worked on, the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Locate the current step in an overarching set of steps and phases similar to a table of contents as many videos and series of videos represent an ordered sequence of vocational skills to accomplish a goal. Even if this summary is viewed out of order it should contain enough detail to locate it in a series of steps. For example : "now that x, y and z are complete as part of phase n, the teacher is now working on step w". Include the motion being represented (ie, cutting, gluing, attaching, bending, punching, sanding, welding) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\nlanguage : #iso639-1 two-character language of summary\n}\n],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image_summaries . Commentaries are used as the sole system context when generating e_learning content and the generated e_learning content is the sole input to fine-tuning in pass2 so each generated artifact must faithfully capture the essence of the multimodal input provided from a vocational training point of view. If this is the first chunk of a video this field may be empty. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from inage summaries derived from keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. To clarify, summarization is not considered to be enhancement but rather distilling existing content. Enhancement is considered to be adding original creative content to preexisting content. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata is provided in all required languages. If this is the first chunk of a video this field may be empty. \n{\nseats_products: [ # cumulative list of the names of seats products referenced in the video so far taken from https://seatshardware.com/collections/all. \n {seats_product: # For example, Thread Sled Easy Mount (TSEM)\n\n#The seats_product metadata keyword must contain a reference to one of the products referenced in the following list of Seats products, accurate as of October 2024, which contains for each product a one line description suffixed by the product URL. This is a summary of https://seatshardware.com/collections/all. \n# 22.5 Degree Disc Assembly - Modular mounting system for seating components. URL: https://seatshardware.com/products/22-5degdiscass\n# Headrest Hardware Repair Kit - Reinforcement kit for i2i linkage styled headrest hardware. URL: https://seatshardware.com/products/headrest-hardware-repair-reinforcement-kit\n# Heavy Duty Support Brackets - Bendable aluminum brackets for custom seating applications. URL: https://seatshardware.com/products/heavy-duty-support-brackets-bendable-flats\n# Joystick Bumper Thumper Kit - Protection system for wheelchair joystick assemblies. URL: https://seatshardware.com/products/joystick-bumper-thumper-kit\n# Just Disc It, For Trays - Disc-based attachment for wheelchair trays. URL: https://seatshardware.com/products/just-disc-it-for-trays\n# PL 003AL12 Aluminum Assembly Brackets - Bendable aluminum assembly brackets. URL: https://seatshardware.com/products/pl-003al12-aluminum-assembly-brackets\n# PL 003ALJH HD Aluminum J-Hooks - Rubber-lined hooks for seat pan installation. URL: https://seatshardware.com/products/pl-003aljh-hd-aluminum-rubber-lined-j-hooks\n# PL 003ST22 Steel Assembly Brackets - Steel brackets for mounting wheelchair accessories. URL: https://seatshardware.com/products/pl-003st22-steel-assembly-brackets\n# Portable Bottle Jack Bender - Portable tool for bending support brackets. URL: https://seatshardware.com/products/portable-bottle-jack-bender\n# SlenderFenders Fit Kits - Wheelchair fender kits designed for various wheel sizes. URL: https://seatshardware.com/products/slender-fender-wheelchair-fenders\n# Space Saver Back-Seat Interface - Aluminum interface for wheelchair seating systems. URL: https://seatshardware.com/products/space-saver-back-seat-interface\n# Swing Away Laterals� Hardware Kit - Kit compatible with Sunrise Medical J3 swing away hardware. URL: https://seatshardware.com/products/swing-away-laterals-hardware-kit\n# Thread Sled Easy Mount Base Model - Adjustable mounting system for custom seating. URL: https://seatshardware.com/products/thread-sled-easy-mount-base-model\n# Thread Sled Easy Mount Headrest Kit - Tool-less headrest mounting system for wheelchairs. URL: https://seatshardware.com/products/thread-sled-easy-mount-headrest-kit\n# Tray Easy Mount - System for attaching custom-built trays to wheelchairs. URL: https://seatshardware.com/products/tray-easy-mount\n\n\nproduct_url: # url of product referenced, for example, https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products :\n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : https://www.sunrisemedical.com/manual-wheelchairs/quickie . Another example : https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n}}"},\n\n{type: image_url,\nimage_url : { url : dropbox url to image},\ntimestamp : string, in srt format, accurate to the millisecond }\n]\n}\n\nThat covers the input format.\n\n\n\n\nRegarding token space management if token space becomes tight, blocks of potentially redundant image summaries may be consolidated by expressing a summary with a list of timestamps. \n\nBe careful to not over summarize however. If the keyframe differences are important in order to properly capture the required vocational skill being used at that moment then avoid removing detail. For example, cutting foam at two different angles may be crucial to the skill being demonstrated. Cutting foam the exact same way may be considered duplication. \n\nAlso, note that identical steps could exist in many different processes so if you collapse identical steps if their context differs then combine details appropriately. For example, the same kind of cut could be made to a piece of foam as part of a new cushion or a cushion repair. If two such summaries are combined then the summary must indicate this step could be executed as part of a cushion creation or repair. \n\nEnsure that any summarization is done only on truly redundant data. \n\n\ngenerate an output in the following JSON format:\n{\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video�s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a summary of the video from the beginning to the most recently analyzed video chunk. This output feeds back in to future inputs to ensure all subtitle and image input given so far is properly summarized. This field is cumulative and is expected no grow over time as file summaries produced are combined with file summaries already present in the input as each video chunk is processed. This summary is expected to be produced from both subtitle and image multimodal inputs. \n{file_commentary : # A string field containing a summary of the video, utilizing data from both the subtitles and image keyframes in the input context and merging with the previous file summary if provided in the input context, ensuring that the commentary covers the entire video up to this point. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\n\nimage_summaries : # A list of objects describing each image present in the input user context. For each image timestamp, one object for each subtitle language is expected to be described. \n[\n{\ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\n\nimage_summary: # A detailed summary of the image that would contain enough information to teach a skilled intern how to imitate the skill being demonstrated. The summary must be based on the visual elements of the actual image or images included and must only describe visual elements actually present in the images. Include the items seen in the image, the tools being used, the products being used and worked on, the the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Include the motion being represented (ie, cutting, gluing, attaching, bending) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \n\nlanguage : # The iso 639-1 two-character language abbreviation\n}],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image summaries without loss of the intrinsic training detail. Commentaries are used as the sole system context to produce e_learning content which in turn are the sole input when doing fine-tuning in pass2 so they must faithfully capture the vocational training essence of the multimodal input provided. Sequences of operations must be broken up into logical steps and each commentary must include enough contextual detail to stand on its own if accessed directly by a user who does not consult neighbouring commentary blocks. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from textual summaries of keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually. \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata must be generated in all required languages. \n{\nseats_products: [{\nseats_product: # cumulative list of the names of distinct seats products referenced in the video so far from https://seatshardware.com/collections/all. For example, Thread Sled Easy Mount (TSEM). This list must not contain duplicates. \n\nproduct_url: # url of product referenced, for example, https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products : # This list must not contain duplicates. \n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : https://www.sunrisemedical.com/manual-wheelchairs/quickie . Another example : https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n\n}\n}\n\n'}, {'role': 'user', 'content': '"[{"type": "text", "text": "{\\"file_commentaries\\": [], \\"subtitles\\": [{\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hello and welcome to Seats. I want to show you a few videos on something I\'m\\", \\"language\\": \\"en\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Bonjour et bienvenue \\\\u00e0 Seats. je veux vous montrer quelques vid\\\\u00e9os sur quelque chose que je suis\\", \\"language\\": \\"fr\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hola y bienvenido a Seats. quiero mostrarle algunos videos sobre algo que soy\\", \\"language\\": \\"es\\"}], \\"image_summaries\\": [], \\"commentaries\\": [], \\"source\\": \\"HumanCreated\\", \\"metadata\\": {}, \\"file_path\\": \\"HOW TO CONVERT A PADDED BODYPOINT STRAP INTO WRIST CUFFS FOR BOTH YOUR CLIENT\'S SAFETY AND HYGIENE (Video\'s 1-9RCTCOP)/1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4\\"}"}, {"type": "image_url", "image_url": {"url": ["data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... RXxVvlAXRwKG8qX1V8UnH8B3jby3fY1LwAAAAAAElFTkSuQmCC"](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE...[truncated]...RXxVvlAXRwKG8qX1V8UnH8B3jby3fY1LwAAAAAAElFTkSuQmCC)", "detail": "high", "timestamp": "00:00:02,906"}}]'}], 'model': 'gpt-4o', 'frequency_penalty': 0, 'max_tokens': 15000, 'presence_penalty': 0, 'temperature': 0.2}}
> 2024-10-26 21:07:27,895 DEBUG: Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
> 2024-10-26 21:07:27,897 DEBUG: connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
> 2024-10-26 21:07:27,946 DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7feaa7c543a0>
> 2024-10-26 21:07:27,946 DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7feaab6a7ec0> server_hostname='api.openai.com' timeout=5.0
> 2024-10-26 21:07:27,957 DEBUG: start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7feaa7c54370>
> 2024-10-26 21:07:27,958 DEBUG: send_request_headers.started request=<Request [b'POST']>
> 2024-10-26 21:07:27,959 DEBUG: send_request_headers.complete
> 2024-10-26 21:07:27,960 DEBUG: send_request_body.started request=<Request [b'POST']>
> 2024-10-26 21:07:28,072 DEBUG: send_request_body.complete
> 2024-10-26 21:07:28,072 DEBUG: receive_response_headers.started request=<Request [b'POST']>
> 2024-10-26 21:07:28,460 DEBUG: receive_response_headers.complete return_value=(b'HTTP/1.1', 400, b'Bad Request', [(b'Date', b'Sun, 27 Oct 2024 01:07:29 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'290'), (b'Connection', b'keep-alive'), (b'access-control-expose-headers', b'X-Request-ID'), (b'openai-organization', b'user-m8spcgbdft4mtc5walogmvdr'), (b'openai-processing-ms', b'79'), (b'openai-version', b'2020-10-01'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'800000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'235490'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'42.338s'), (b'x-request-id', b'req_6d8561a71dd342e16db9bea54ace3376'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains; preload'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Set-Cookie', b'__cf_bm=yM_H4K_Skce.eu9BHWcqR30dZh4pl9NB8hhMlFOUpuw-1729991249-1.0.1.1-1ScK7nz34q8mKMJZT58CXTfok3uhIIfXRHvPDtxJlJmT7SidSeBfwYLTUKjeLarnFR0iX8xu7cWaYdiz5bbToQ; path=/; expires=Sun, 27-Oct-24 01:37:29 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'X-Content-Type-Options', b'nosniff'), (b'Set-Cookie', b'_cfuvid=rYa76V_wIF102OqieEbs17JUjU3W_ZEdVSHqRQVGHco-1729991249588-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8d8eca1add39a2b7-YUL'), (b'alt-svc', b'h3=":443"; ma=86400')])
> 2024-10-26 21:07:28,466 INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
> 2024-10-26 21:07:28,467 DEBUG: receive_response_body.started request=<Request [b'POST']>
> 2024-10-26 21:07:28,468 DEBUG: receive_response_body.complete
> 2024-10-26 21:07:28,469 DEBUG: response_closed.started
> 2024-10-26 21:07:28,469 DEBUG: response_closed.complete
> 2024-10-26 21:07:28,470 DEBUG: HTTP Response: POST https://api.openai.com/v1/chat/completions "400 Bad Request" Headers([('date', 'Sun, 27 Oct 2024 01:07:29 GMT'), ('content-type', 'application/json'), ('content-length', '290'), ('connection', 'keep-alive'), ('access-control-expose-headers', 'X-Request-ID'), ('openai-organization', 'user-m8spcgbdft4mtc5walogmvdr'), ('openai-processing-ms', '79'), ('openai-version', '2020-10-01'), ('x-ratelimit-limit-requests', '5000'), ('x-ratelimit-limit-tokens', '800000'), ('x-ratelimit-remaining-requests', '4999'), ('x-ratelimit-remaining-tokens', '235490'), ('x-ratelimit-reset-requests', '12ms'), ('x-ratelimit-reset-tokens', '42.338s'), ('x-request-id', 'req_6d8561a71dd342e16db9bea54ace3376'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=yM_H4K_Skce.eu9BHWcqR30dZh4pl9NB8hhMlFOUpuw-1729991249-1.0.1.1-1ScK7nz34q8mKMJZT58CXTfok3uhIIfXRHvPDtxJlJmT7SidSeBfwYLTUKjeLarnFR0iX8xu7cWaYdiz5bbToQ; path=/; expires=Sun, 27-Oct-24 01:37:29 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('x-content-type-options', 'nosniff'), ('set-cookie', '_cfuvid=rYa76V_wIF102OqieEbs17JUjU3W_ZEdVSHqRQVGHco-1729991249588-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8d8eca1add39a2b7-YUL'), ('alt-svc', 'h3=":443"; ma=86400')])
> 2024-10-26 21:07:28,471 DEBUG: request_id: req_6d8561a71dd342e16db9bea54ace3376
> 2024-10-26 21:07:28,471 DEBUG: Encountered httpx.HTTPStatusError
> Traceback (most recent call last):
> File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/openai/_base_client.py", line 1037, in _request
> response.raise_for_status()
> File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
> raise HTTPStatusError(message, request=request, response=self)
> httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://api.openai.com/v1/chat/completions'
> For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
> 2024-10-26 21:07:28,474 DEBUG: Not retrying
> 2024-10-26 21:07:28,475 DEBUG: Re-raising status error
> 2024-10-26 21:07:31,001 ERROR: Failed to perform inference for ./1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4: Error code: 400 - {'error': {'message': "Invalid 'messages[1].content': string too long. Expected a string with maximum length 1048576, but got a string with length 2237388 instead.", 'type': 'invalid_request_error', 'param': 'messages[1].content', 'code': 'string_above_max_length'}}
>
>

mdear · October 27, 2024, 11:50pm

I will now show logs from my failed API inference call with three subtitles and one keyframe, this time failing due to input context window exceeded:

> 2024-10-26 21:24:41,118 DEBUG: Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': '\n You are tasked with processing wheelchair seating vocational training videos. Your goal is to analyze and summarize the video. \n\n The expected output will include:\n - "file_commentaries": Commentary about the video as a whole.\n - "subtitles": Subtitles extracted from the video, tied to specific timestamps.\n - "image_summaries": A summary of keyframes extracted from the video, based on timestamped keyframe URLs.\n - "commentaries": A section that ties together insights derived from subtitles and keyframes.\n - "source": Details about the video source, including whether it is human-created or AI-generated.\n - "metadata": Any additional metadata required to describe the context.\n\n The expected input will include:\n - A list of keyframe URLs, localized by timestamp, that represent important visual elements of the video.\n - A set of subtitles, localized by timestamp, that convey important spoken content.\n - Summaries produced by previous inference runs on a given video (empty for the first inference run)\n\n Input and output formats are valid JSON objects.\n\nA few specific details to consider:\n- You will be presented with a chronological series of subtitles in multiple languages. \nFor each language, ensure you rebuild the narrative to ensure continuity is maintained. For example, if one subtitle says "It\'s really cool actually how you can convert a body point padded one" and the next subtitle says "and a half inch belt with plastic side released buckle on it" you should come to the conclusion that the trainer is showing a one and a half inch belt, not a half inch belt. Precision is extremely important to maintain when you are summarizing. \n\n\n\n\nHere is a list of abbreviations that the content producer uses to compose video filenames and series names. It is in a form of a table in which the abbreviation is further explained. Use this information to better understand the intention of any video or series name.\n\nAbbreviation List\nASBS : ASSEMBLY BRACKETS STEEL (FLATS)\nASBA : ASSEMBLY BRACKETS ALUMINUM (FLATS)\nBC : BACK COVER\nBCU : BACK CUSHION\nBINTH : BACK INTERFACE HARDWARE\nBINT : BACK INTERFACE\nBT : BLACK TRAY\nCFS : CALF SUPPORT\nCH : CUP HOLDER\nCHAC : CHAIR ACCESSORIES\nCHR : CUSTOM HEADREST\nCOHR : COMMERCIAL HEADREST\nCH : CUP HOLDER\nCOB : COMMERCIAL BACK\nCUP : CUSTOM POSITIONING STRAP\nCOP : COMMERCIAL POSITIOINING STRAP\nCOS : COMMERCIAL SEAT\nCT : CLEAR TRAY\nCOMP : COMPRESSION SPRING\nFTB : FOOTBOX\nFTR : FOOTREST\nFIP : FOAM IN PLACE\nHLT : HINGED LAP TRAY\nLAT : LATERAL\nMOB : MOLDED BACK\nMODS : MODIFICATIONS\nPAD : HANGERS AND ARMREST\nP - bjb : Portable bottle jack bender\nREP : REPAIRS\nROHO : SEAT AND BACK BOLSTERS\nSB : SUPPORT BRACKETS\nSC : SEAT COVER\nSCU : SEAT CUSHION\nSINT : SEAT INTERFACE\nSKI : SIT SKI\nSLF : SLENDERFENDER FIT KIT\nTEM : TRAY EASY MOUNT\nTSEM : THREAD SLED EASY MOUNT (TOOL LESS ADJUSTMENT)\n\n\nThe user prompt (including the contents of the content/text block of the user role) shall be in JSON format, as per the following specification:\n\nThe user input prompt format is as follows:\n{\nrole: user,\ncontent : [\n{type: text ,\ntext : "{\n\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video\'s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a cumulative summary of the video from its beginning to the currently analyzed video chunk. It may be empty if this is the first chunk of the video being analyzed and no prior inference has produce commentaries thus far. This input consists of previously generated summary output from the current video�s previously analyzed chunks (if any) to ensure all file summary input given so far is considered when asking for a new summary to be built considering the current video chunk. \n{file_commentary : # A string field containing a cumulative summary of the video, ultimately ensuring that the file is summarized using all subtitle and image keyframe data presented this far. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\nsubtitles: [ # This list may be empty if the earliest image timestamp is less than the first subtitle timestamp.\n {timestamp : #string, in srt format\n subtitle : #contents of subtitle generated from the audio track of the video \n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\nimage_summaries: [# Initially empty, this contains a summary of each image analyzed thus far. The summary must be based on the visual elements of the actual image or images included and must only describe actual visual elements present in the images. The number of objects in this list for a given timestamp is expected to be the number of languages for which subtitles are provided in the user input context. The purpose of this list is to provide context to generate commentaries and cumulative file summaries that draw from multimodal subtitle and image inputs. If this is the first chunk of a video this field may be empty. \n{image_summary: # A detailed textual summary of the image that would contain enough information to teach a skilled intern how to accurately and correctly imitate the skill being demonstrated. Include the items seen in the image, the tools being used, the products being used and worked on, the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Locate the current step in an overarching set of steps and phases similar to a table of contents as many videos and series of videos represent an ordered sequence of vocational skills to accomplish a goal. Even if this summary is viewed out of order it should contain enough detail to locate it in a series of steps. For example : "now that x, y and z are complete as part of phase n, the teacher is now working on step w". Include the motion being represented (ie, cutting, gluing, attaching, bending, punching, sanding, welding) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\nlanguage : #iso639-1 two-character language of summary\n}\n],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image_summaries . Commentaries are used as the sole system context when generating e_learning content and the generated e_learning content is the sole input to fine-tuning in pass2 so each generated artifact must faithfully capture the essence of the multimodal input provided from a vocational training point of view. If this is the first chunk of a video this field may be empty. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from inage summaries derived from keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. To clarify, summarization is not considered to be enhancement but rather distilling existing content. Enhancement is considered to be adding original creative content to preexisting content. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata is provided in all required languages. If this is the first chunk of a video this field may be empty. \n{\nseats_products: [ # cumulative list of the names of seats products referenced in the video so far taken from https://seatshardware.com/collections/all. \n {seats_product: # For example, Thread Sled Easy Mount (TSEM)\n\n#The seats_product metadata keyword must contain a reference to one of the products referenced in the following list of Seats products, accurate as of October 2024, which contains for each product a one line description suffixed by the product URL. This is a summary of https://seatshardware.com/collections/all. \n# 22.5 Degree Disc Assembly - Modular mounting system for seating components. URL: https://seatshardware.com/products/22-5degdiscass\n# Headrest Hardware Repair Kit - Reinforcement kit for i2i linkage styled headrest hardware. URL: https://seatshardware.com/products/headrest-hardware-repair-reinforcement-kit\n# Heavy Duty Support Brackets - Bendable aluminum brackets for custom seating applications. URL: https://seatshardware.com/products/heavy-duty-support-brackets-bendable-flats\n# Joystick Bumper Thumper Kit - Protection system for wheelchair joystick assemblies. URL: https://seatshardware.com/products/joystick-bumper-thumper-kit\n# Just Disc It, For Trays - Disc-based attachment for wheelchair trays. URL: https://seatshardware.com/products/just-disc-it-for-trays\n# PL 003AL12 Aluminum Assembly Brackets - Bendable aluminum assembly brackets. URL: https://seatshardware.com/products/pl-003al12-aluminum-assembly-brackets\n# PL 003ALJH HD Aluminum J-Hooks - Rubber-lined hooks for seat pan installation. URL: https://seatshardware.com/products/pl-003aljh-hd-aluminum-rubber-lined-j-hooks\n# PL 003ST22 Steel Assembly Brackets - Steel brackets for mounting wheelchair accessories. URL: https://seatshardware.com/products/pl-003st22-steel-assembly-brackets\n# Portable Bottle Jack Bender - Portable tool for bending support brackets. URL: https://seatshardware.com/products/portable-bottle-jack-bender\n# SlenderFenders Fit Kits - Wheelchair fender kits designed for various wheel sizes. URL: https://seatshardware.com/products/slender-fender-wheelchair-fenders\n# Space Saver Back-Seat Interface - Aluminum interface for wheelchair seating systems. URL: https://seatshardware.com/products/space-saver-back-seat-interface\n# Swing Away Laterals� Hardware Kit - Kit compatible with Sunrise Medical J3 swing away hardware. URL: https://seatshardware.com/products/swing-away-laterals-hardware-kit\n# Thread Sled Easy Mount Base Model - Adjustable mounting system for custom seating. URL: https://seatshardware.com/products/thread-sled-easy-mount-base-model\n# Thread Sled Easy Mount Headrest Kit - Tool-less headrest mounting system for wheelchairs. URL: https://seatshardware.com/products/thread-sled-easy-mount-headrest-kit\n# Tray Easy Mount - System for attaching custom-built trays to wheelchairs. URL: https://seatshardware.com/products/tray-easy-mount\n\n\nproduct_url: # url of product referenced, for example, https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products :\n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : https://www.sunrisemedical.com/manual-wheelchairs/quickie . Another example : https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n}}"},\n\n{type: image_url,\nimage_url : { url : dropbox url to image},\ntimestamp : string, in srt format, accurate to the millisecond }\n]\n}\n\nThat covers the input format.\n\n\n\n\nRegarding token space management if token space becomes tight, blocks of potentially redundant image summaries may be consolidated by expressing a summary with a list of timestamps. \n\nBe careful to not over summarize however. If the keyframe differences are important in order to properly capture the required vocational skill being used at that moment then avoid removing detail. For example, cutting foam at two different angles may be crucial to the skill being demonstrated. Cutting foam the exact same way may be considered duplication. \n\nAlso, note that identical steps could exist in many different processes so if you collapse identical steps if their context differs then combine details appropriately. For example, the same kind of cut could be made to a piece of foam as part of a new cushion or a cushion repair. If two such summaries are combined then the summary must indicate this step could be executed as part of a cushion creation or repair. \n\nEnsure that any summarization is done only on truly redundant data. \n\n\ngenerate an output in the following JSON format:\n{\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video�s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a summary of the video from the beginning to the most recently analyzed video chunk. This output feeds back in to future inputs to ensure all subtitle and image input given so far is properly summarized. This field is cumulative and is expected no grow over time as file summaries produced are combined with file summaries already present in the input as each video chunk is processed. This summary is expected to be produced from both subtitle and image multimodal inputs. \n{file_commentary : # A string field containing a summary of the video, utilizing data from both the subtitles and image keyframes in the input context and merging with the previous file summary if provided in the input context, ensuring that the commentary covers the entire video up to this point. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\n\nimage_summaries : # A list of objects describing each image present in the input user context. For each image timestamp, one object for each subtitle language is expected to be described. \n[\n{\ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\n\nimage_summary: # A detailed summary of the image that would contain enough information to teach a skilled intern how to imitate the skill being demonstrated. The summary must be based on the visual elements of the actual image or images included and must only describe visual elements actually present in the images. Include the items seen in the image, the tools being used, the products being used and worked on, the the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Include the motion being represented (ie, cutting, gluing, attaching, bending) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \n\nlanguage : # The iso 639-1 two-character language abbreviation\n}],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image summaries without loss of the intrinsic training detail. Commentaries are used as the sole system context to produce e_learning content which in turn are the sole input when doing fine-tuning in pass2 so they must faithfully capture the vocational training essence of the multimodal input provided. Sequences of operations must be broken up into logical steps and each commentary must include enough contextual detail to stand on its own if accessed directly by a user who does not consult neighbouring commentary blocks. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from textual summaries of keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually. \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata must be generated in all required languages. \n{\nseats_products: [{\nseats_product: # cumulative list of the names of distinct seats products referenced in the video so far from https://seatshardware.com/collections/all. For example, Thread Sled Easy Mount (TSEM). This list must not contain duplicates. \n\nproduct_url: # url of product referenced, for example, https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products : # This list must not contain duplicates. \n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : https://www.sunrisemedical.com/manual-wheelchairs/quickie . Another example : https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n\n}\n}\n\n'}, {'role': 'user', 'content': '"[{"type": "text", "text": "{\\"file_commentaries\\": [], \\"subtitles\\": [{\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hello and welcome to Seats. I want to show you a few videos on something I\'m\\", \\"language\\": \\"en\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Bonjour et bienvenue \\\\u00e0 Seats. je veux vous montrer quelques vid\\\\u00e9os sur quelque chose que je suis\\", \\"language\\": \\"fr\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hola y bienvenido a Seats. quiero mostrarle algunos videos sobre algo que soy\\", \\"language\\": \\"es\\"}], \\"image_summaries\\": [], \\"commentaries\\": [], \\"source\\": \\"HumanCreated\\", \\"metadata\\": {}, \\"file_path\\": \\"HOW TO CONVERT A PADDED BODYPOINT STRAP INTO WRIST CUFFS FOR BOTH YOUR CLIENT\'S SAFETY AND HYGIENE (Video\'s 1-9RCTCOP)/1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4\\"}"}, {"type": "image_url", "image_url": {"url": ["data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAIBAQEBAQIBAQECAg ... [truncated] ... SUDShjkfrxTC4ON79ffiiipiAJLKTsOP8AepPtEkf8QooqQP/Z"](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAIBAQEBAQIBAQECAg...[truncated]...SUDShjkfrxTC4ON79ffiiipiAJLKTsOP8AepPtEkf8QooqQP/Z)", "detail": "high", "timestamp": "00:00:02,906"}}]'}], 'model': 'gpt-4o', 'frequency_penalty': 0, 'max_tokens': 15000, 'presence_penalty': 0, 'temperature': 0.2}}
> 2024-10-26 21:24:41,124 DEBUG: Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
> 2024-10-26 21:24:41,125 DEBUG: connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
> 2024-10-26 21:24:41,189 DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f45f8c6de70>
> 2024-10-26 21:24:41,190 DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7f46b42bec40> server_hostname='api.openai.com' timeout=5.0
> 2024-10-26 21:24:41,202 DEBUG: start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f45f8c6de40>
> 2024-10-26 21:24:41,202 DEBUG: send_request_headers.started request=<Request [b'POST']>
> 2024-10-26 21:24:41,204 DEBUG: send_request_headers.complete
> 2024-10-26 21:24:41,204 DEBUG: send_request_body.started request=<Request [b'POST']>
> 2024-10-26 21:24:41,226 DEBUG: send_request_body.complete
> 2024-10-26 21:24:41,227 DEBUG: receive_response_headers.started request=<Request [b'POST']>
> 2024-10-26 21:24:41,672 DEBUG: receive_response_headers.complete return_value=(b'HTTP/1.1', 400, b'Bad Request', [(b'Date', b'Sun, 27 Oct 2024 01:24:42 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'284'), (b'Connection', b'keep-alive'), (b'access-control-expose-headers', b'X-Request-ID'), (b'openai-organization', b'user-m8spcgbdft4mtc5walogmvdr'), (b'openai-processing-ms', b'332'), (b'openai-version', b'2020-10-01'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'800000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'693580'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'7.981s'), (b'x-request-id', b'req_05aa955aab3e928b9ac7998de51204fa'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains; preload'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Set-Cookie', b'__cf_bm=iMDPQKmwQ1q5junQFy4mRvxRLlGUmDYDW8ZyCAR6qxk-1729992282-1.0.1.1-PpbUnmjzDxjx.QWnXEcc1lgC5ZHXdyacC9inb.1CzT8Y1EDD0xfqkASmmdsSzKP8ltEC5_L.bCZyUjUTTsSgiQ; path=/; expires=Sun, 27-Oct-24 01:54:42 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'X-Content-Type-Options', b'nosniff'), (b'Set-Cookie', b'_cfuvid=YMs.Fty7Ao6Srk_R5mO3gU4rf62RHaLEBDxKDDB.gnM-1729992282809-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8d8ee354bca6a2bd-YUL'), (b'alt-svc', b'h3=":443"; ma=86400')])
> 2024-10-26 21:24:41,675 INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
> 2024-10-26 21:24:41,675 DEBUG: receive_response_body.started request=<Request [b'POST']>
> 2024-10-26 21:24:41,676 DEBUG: receive_response_body.complete
> 2024-10-26 21:24:41,676 DEBUG: response_closed.started
> 2024-10-26 21:24:41,677 DEBUG: response_closed.complete
> 2024-10-26 21:24:41,677 DEBUG: HTTP Response: POST https://api.openai.com/v1/chat/completions "400 Bad Request" Headers([('date', 'Sun, 27 Oct 2024 01:24:42 GMT'), ('content-type', 'application/json'), ('content-length', '284'), ('connection', 'keep-alive'), ('access-control-expose-headers', 'X-Request-ID'), ('openai-organization', 'user-m8spcgbdft4mtc5walogmvdr'), ('openai-processing-ms', '332'), ('openai-version', '2020-10-01'), ('x-ratelimit-limit-requests', '5000'), ('x-ratelimit-limit-tokens', '800000'), ('x-ratelimit-remaining-requests', '4999'), ('x-ratelimit-remaining-tokens', '693580'), ('x-ratelimit-reset-requests', '12ms'), ('x-ratelimit-reset-tokens', '7.981s'), ('x-request-id', 'req_05aa955aab3e928b9ac7998de51204fa'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=iMDPQKmwQ1q5junQFy4mRvxRLlGUmDYDW8ZyCAR6qxk-1729992282-1.0.1.1-PpbUnmjzDxjx.QWnXEcc1lgC5ZHXdyacC9inb.1CzT8Y1EDD0xfqkASmmdsSzKP8ltEC5_L.bCZyUjUTTsSgiQ; path=/; expires=Sun, 27-Oct-24 01:54:42 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('x-content-type-options', 'nosniff'), ('set-cookie', '_cfuvid=YMs.Fty7Ao6Srk_R5mO3gU4rf62RHaLEBDxKDDB.gnM-1729992282809-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8d8ee354bca6a2bd-YUL'), ('alt-svc', 'h3=":443"; ma=86400')])
> 2024-10-26 21:24:41,678 DEBUG: request_id: req_05aa955aab3e928b9ac7998de51204fa
> 2024-10-26 21:24:41,678 DEBUG: Encountered httpx.HTTPStatusError
> Traceback (most recent call last):
> File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/openai/_base_client.py", line 1037, in _request
> response.raise_for_status()
> File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
> raise HTTPStatusError(message, request=request, response=self)
> httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://api.openai.com/v1/chat/completions'
> For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
> 2024-10-26 21:24:41,680 DEBUG: Not retrying
> 2024-10-26 21:24:41,680 DEBUG: Re-raising status error
> 2024-10-26 21:24:51,069 ERROR: Failed to perform inference for ./1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 277597 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
>
>
>
>
>
>
>
>

On 2024-10-26 7:20 p.m., Myles Dear Hotmail wrote:

Is there actually a human being on the other end of this support line?

I’m hearing the same advice again and again, despite me saying again and again I have covered off all the points and still the feature doesn’t work.

At the very least you could tell me the rules driving the declaration of “429 - Too Many Requests” as I stepped away from my terminal for a few hours, retried, and am still getting this error. I paid enough to promote me up from Tier 2 to Tier 3 but still I’m not able to push even a single request through.

You can do better.

mdear · October 27, 2024, 11:51pm

On 2024-10-26 3:16 p.m., Marielle from OpenAI wrote:

Hello Myles,

Thank you for reaching out to OpenAI support.

We’re sorry to hear about the issues you’re experiencing with the API-based multimodal image interpretation. We understand how critical this functionality is for your business, and we appreciate the detailed information you’ve provided.

Here are some steps and considerations to help troubleshoot and resolve the issues:

Token Issues and Image URL Handling

Token Budget and Image URL Handling: When using image URLs, the token cost is primarily for the URL itself, not the image content. The image content is processed separately, and the token cost for the vectorized image data is applied. If the token budget is exceeded, the API should return an appropriate error code. However, if the image URL is not accessible or the image cannot be processed, the model might guess based on the available context.

Base64 Image Data: Using base64 encoded image data can be more reliable as it ensures the image is directly included in the request. This method avoids potential issues with URL accessibility. The token cost for base64 encoded images is based on the vectorized data, not the raw base64 string.

Troubleshooting Steps

Model Version: Ensure you are using the same model version (gpt-4o) for both API and ChatGPT to maintain consistency in responses.

Detail Parameter: Experiment with the detail parameter set to “high” to see if it improves the level of detail in the image analysis.

Image Accessibility: Double-check that the Dropbox image URL is publicly accessible and not restricted. Test the URL in a browser to ensure it loads correctly.

API Key Permissions: Verify that your API key has the necessary permissions and hasn’t reached any usage limits.

Review API Documentation: Revisit the Vision - OpenAI API documentation to ensure there haven’t been any updates or changes that might affect how image URLs are processed.

Addressing Rate Limits and Token Errors

Rate Limit Exceeded: The error message indicates that the request exceeds the token per minute (TPM) limit. This can happen if the combined input and output tokens are too high. Consider breaking down the request into smaller chunks or reducing the number of images processed in a single call.

Higher Tier Access: If you frequently encounter rate limits, consider upgrading to a higher tier to increase your token limits.

Please let us know how it goes, and if there’s anything else we can assist you with.

Best Regards,

Marielle

OpenAI Support

100x271180×320 9.52 KB

On Sun, Oct 27, 2024 at 01:25 AM, Myles Dear Hotmail <smdear@hotmail.com> wrote:

Still stuck. I’m still hitting token rate limits even with only a single set of subtitles and a single image. This is impossible.

Help?

mdear · October 27, 2024, 11:53pm

> > > > 2024-10-26 11:17:25,748 DEBUG: Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': '\n You are tasked with processing wheelchair seating vocational training videos. Your goal is to analyze and summarize the video. \n\n The expected output will include:\n - "file_commentaries": Commentary about the video as a whole.\n - "subtitles": Subtitles extracted from the video, tied to specific timestamps.\n - "image_summaries": A summary of keyframes extracted from the video, based on timestamped keyframe URLs.\n - "commentaries": A section that ties together insights derived from subtitles and keyframes.\n - "source": Details about the video source, such as filename or other relevant details.\n - "metadata": Any additional metadata required to describe the context.\n\n The expected input will include:\n - A list of keyframe URLs, localized by timestamp, that represent important visual elements of the video.\n - A set of subtitles, localized by timestamp, that convey important spoken content.\n - Summaries produced by previous inference runs on a given video (empty for the first inference run)\n\n Input and output formats are valid JSON objects.\n\nA few specific details to consider:\n- You will be presented with a chronological series of subtitles in multiple languages. \nFor each language, ensure you rebuild the narrative to ensure continuity is maintained. For example, if one subtitle says "It\'s really cool actually how you can convert a body point padded one" and the next subtitle says "and a half inch belt with plastic side released buckle on it" you should come to the conclusion that the trainer is showing a one and a half inch belt, not a half inch belt. Precision is extremely important to maintain when you are summarizing. \n\n\n\n\nHere is a list of abbreviations that the content producer uses to compose video filenames and series names. It is in a form of a table in which the abbreviation is further explained. Use this information to better understand the intention of any video or series name.\n\nAbbreviation List\nASBS : ASSEMBLY BRACKETS STEEL (FLATS)\nASBA : ASSEMBLY BRACKETS ALUMINUM (FLATS)\nBC : BACK COVER\nBCU : BACK CUSHION\nBINTH : BACK INTERFACE HARDWARE\nBINT : BACK INTERFACE\nBT : BLACK TRAY\nCFS : CALF SUPPORT\nCH : CUP HOLDER\nCHAC : CHAIR ACCESSORIES\nCHR : CUSTOM HEADREST\nCOHR : COMMERCIAL HEADREST\nCH : CUP HOLDER\nCOB : COMMERCIAL BACK\nCUP : CUSTOM POSITIONING STRAP\nCOP : COMMERCIAL POSITIOINING STRAP\nCOS : COMMERCIAL SEAT\nCT : CLEAR TRAY\nCOMP : COMPRESSION SPRING\nFTB : FOOTBOX\nFTR : FOOTREST\nFIP : FOAM IN PLACE\nHLT : HINGED LAP TRAY\nLAT : LATERAL\nMOB : MOLDED BACK\nMODS : MODIFICATIONS\nPAD : HANGERS AND ARMREST\nP - bjb : Portable bottle jack bender\nREP : REPAIRS\nROHO : SEAT AND BACK BOLSTERS\nSB : SUPPORT BRACKETS\nSC : SEAT COVER\nSCU : SEAT CUSHION\nSINT : SEAT INTERFACE\nSKI : SIT SKI\nSLF : SLENDERFENDER FIT KIT\nTEM : TRAY EASY MOUNT\nTSEM : THREAD SLED EASY MOUNT (TOOL LESS ADJUSTMENT)\n\n\nThe user prompt (including the contents of the content/text block of the user role) shall be in JSON format, as per the following specification:\n\nThe user input prompt format is as follows:\n{\nrole: user,\ncontent : [\n{type: text ,\ntext : "{\n\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video\'s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a cumulative summary of the video from its beginning to the currently analyzed video chunk. It may be empty if this is the first chunk of the video being analyzed and no prior inference has produce commentaries thus far. This input consists of previously generated summary output from the current video?s previously analyzed chunks (if any) to ensure all file summary input given so far is considered when asking for a new summary to be built considering the current video chunk. \n{file_commentary : # A string field containing a cumulative summary of the video, ultimately ensuring that the file is summarized using all subtitle and image keyframe data presented this far. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\nsubtitles: [ # This list may be empty if the earliest image timestamp is less than the first subtitle timestamp.\n {timestamp : #string, in srt format\n subtitle : #contents of subtitle generated from the audio track of the video \n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\nimage_summaries: [# Initially empty, this contains a summary of each image analyzed thus far. The summary must be based on the visual elements of the actual image or images included and must only describe actual visual elements present in the images. The number of objects in this list for a given timestamp is expected to be the number of languages for which subtitles are provided in the user input context. The purpose of this list is to provide context to generate commentaries and cumulative file summaries that draw from multimodal subtitle and image inputs. If this is the first chunk of a video this field may be empty. \n{image_summary: # A detailed textual summary of the image that would contain enough information to teach a skilled intern how to accurately and correctly imitate the skill being demonstrated. Include the items seen in the image, the tools being used, the products being used and worked on, the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Locate the current step in an overarching set of steps and phases similar to a table of contents as many videos and series of videos represent an ordered sequence of vocational skills to accomplish a goal. Even if this summary is viewed out of order it should contain enough detail to locate it in a series of steps. For example : "now that x, y and z are complete as part of phase n, the teacher is now working on step w". Include the motion being represented (ie, cutting, gluing, attaching, bending, punching, sanding, welding) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\nlanguage : #iso639-1 two-character language of summary\n}\n],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image_summaries . Commentaries are used as the sole system context when generating e_learning content and the generated e_learning content is the sole input to fine-tuning in pass2 so each generated artifact must faithfully capture the essence of the multimodal input provided from a vocational training point of view. If this is the first chunk of a video this field may be empty. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from inage summaries derived from keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. To clarify, summarization is not considered to be enhancement but rather distilling existing content. Enhancement is considered to be adding original creative content to preexisting content. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata is provided in all required languages. If this is the first chunk of a video this field may be empty. \n{\nseats_products: [ # cumulative list of the names of seats products referenced in the video so far taken from [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669918719265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=TJhRwjT0ccaCmT4pM9Gw2LbIQoloeSz7WW5rRg%2BO1Mg%3D&reserved=0). \n {seats_product: # For example, Thread Sled Easy Mount (TSEM)\n\n#The seats_product metadata keyword must contain a reference to one of the products referenced in the following list of Seats products, accurate as of October 2024, which contains for each product a one line description suffixed by the product URL. This is a summary of [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669918738088%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=tv%2FFVen2e7ePQZv0QnPC5q7LiBVNikNMyi4025Sgqzc%3D&reserved=0). \n# 22.5 Degree Disc Assembly - Modular mounting system for seating components. URL: [https://seatshardware.com/products/22-5degdiscass\n#](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2F22-5degdiscass%2Fn%23&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669918754634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=92uYEU82FubE%2BSiEe%2BL92ZLC6Y%2FDmTugBRCh1C%2FwD3E%3D&reserved=0) : # url of product referenced, for example, [https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2Fportable-bottle-jack-bender%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669918977371%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DpUJfi8iLsJzzdNnEbZE35mvTlvrkZX3uKZjMJPI4Ag%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products :\n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : [https://www.sunrisemedical.com/manual-wheelchairs/quickie](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sunrisemedical.com%2Fmanual-wheelchairs%2Fquickie&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669918994113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=gTWUvJgmwQi0vEQa%2FgOPTxp66u%2Bkh8MIVrzJ%2F%2FFbwYU%3D&reserved=0) . Another example : [https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bodypoint.com%2FECommerce%2Fproduct%2Fevof%2Fevoflex-%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919009209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FN2CgeuBVxvjFaiGs%2B2dlrXN6mWM6tDqLum7GIzTTRQ%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n}}"},\n\n{type: image_url,\nimage_url : { url : dropbox url to image},\ntimestamp : string, in srt format, accurate to the millisecond }\n]\n}\n\nThat covers the input format.\n\n\n\n\nRegarding token space management if token space becomes tight, blocks of potentially redundant image summaries may be consolidated by expressing a summary with a list of timestamps. \n\nBe careful to not over summarize however. If the keyframe differences are important in order to properly capture the required vocational skill being used at that moment then avoid removing detail. For example, cutting foam at two different angles may be crucial to the skill being demonstrated. Cutting foam the exact same way may be considered duplication. \n\nAlso, note that identical steps could exist in many different processes so if you collapse identical steps if their context differs then combine details appropriately. For example, the same kind of cut could be made to a piece of foam as part of a new cushion or a cushion repair. If two such summaries are combined then the summary must indicate this step could be executed as part of a cushion creation or repair. \n\nEnsure that any summarization is done only on truly redundant data. \n\n\ngenerate an output in the following JSON format:\n{\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video?s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a summary of the video from the beginning to the most recently analyzed video chunk. This output feeds back in to future inputs to ensure all subtitle and image input given so far is properly summarized. This field is cumulative and is expected no grow over time as file summaries produced are combined with file summaries already present in the input as each video chunk is processed. This summary is expected to be produced from both subtitle and image multimodal inputs. \n{file_commentary : # A string field containing a summary of the video, utilizing data from both the subtitles and image keyframes in the input context and merging with the previous file summary if provided in the input context, ensuring that the commentary covers the entire video up to this point. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\n\nimage_summaries : # A list of objects describing each image present in the input user context. For each image timestamp, one object for each subtitle language is expected to be described. \n[\n{\ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\n\nimage_summary: # A detailed summary of the image that would contain enough information to teach a skilled intern how to imitate the skill being demonstrated. The summary must be based on the visual elements of the actual image or images included and must only describe visual elements actually present in the images. Include the items seen in the image, the tools being used, the products being used and worked on, the the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Include the motion being represented (ie, cutting, gluing, attaching, bending) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \n\nlanguage : # The iso 639-1 two-character language abbreviation\n}],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image summaries without loss of the intrinsic training detail. Commentaries are used as the sole system context to produce e_learning content which in turn are the sole input when doing fine-tuning in pass2 so they must faithfully capture the vocational training essence of the multimodal input provided. Sequences of operations must be broken up into logical steps and each commentary must include enough contextual detail to stand on its own if accessed directly by a user who does not consult neighbouring commentary blocks. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from textual summaries of keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually. \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata must be generated in all required languages. \n{\nseats_products: [{\nseats_product: # cumulative list of the names of distinct seats products referenced in the video so far from [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919025589%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Y95kMiI3vLfzbcYL%2FfJ3bqU8AZ6BLB3wNyjy7DlU6SA%3D&reserved=0). For example, Thread Sled Easy Mount (TSEM). This list must not contain duplicates. \n\nproduct_url: # url of product referenced, for example, [https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2Fportable-bottle-jack-bender%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919039825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=8VoXbsPllT64Gw53P0SBDNoS3XTfa%2Bbq2BN6ybH3Ugs%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products : # This list must not contain duplicates. \n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : [https://www.sunrisemedical.com/manual-wheelchairs/quickie](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sunrisemedical.com%2Fmanual-wheelchairs%2Fquickie&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919058443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=nUZGPszD8QzKJqv8RVm69S6KrAbBNNuqtx%2BQ30xDHus%3D&reserved=0) . Another example : [https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bodypoint.com%2FECommerce%2Fproduct%2Fevof%2Fevoflex-%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919084556%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=LxnA6whtMyBNAtU2xR0tmNcHtnBENV7pALFvb790bYw%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n\n}\n}\n\n'}, {'role': 'user', 'content': '"[{"type": "text", "text": "{\\"file_commentaries\\": [], \\"subtitles\\": [{\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hello and welcome to Seats. I want to show you a few videos on something I\'m\\", \\"language\\": \\"en\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Bonjour et bienvenue \\\\u00e0 Seats. je veux vous montrer quelques vid\\\\u00e9os sur quelque chose que je suis\\", \\"language\\": \\"fr\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hola y bienvenido a Seats. quiero mostrarle algunos videos sobre algo que soy\\", \\"language\\": \\"es\\"}], \\"image_summaries\\": [], \\"commentaries\\": [], \\"source\\": \\"HumanCreated\\", \\"metadata\\": {}, \\"file_path\\": \\"HOW TO CONVERT A PADDED BODYPOINT STRAP INTO WRIST CUFFS FOR BOTH YOUR CLIENT\'S SAFETY AND HYGIENE (Video\'s 1-9RCTCOP)/1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4\\"}"}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... RXxVvlAXRwKG8qX1V8UnH8B3jby3fY1LwAAAAAAElFTkSuQmCC"", "detail": "high", "timestamp": "00:00:02,906"}}]'}], 'model': 'gpt-4o', 'frequency_penalty': 0, 'max_tokens': 15000, 'presence_penalty': 0, 'temperature': 0.2}}

mdear · October 27, 2024, 11:53pm






> > > > 2024-10-26 11:17:25,759 DEBUG: Sending HTTP Request: POST [https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919101880%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NVHVHUrRNPUU72scUAi4SvTqsYrn0RdvIF4sDZlZtXU%3D&reserved=0)
> > > > 2024-10-26 11:17:25,760 DEBUG: send_request_headers.started request=<Request [b'POST']>
> > > > 2024-10-26 11:17:25,761 DEBUG: send_request_headers.complete
> > > > 2024-10-26 11:17:25,762 DEBUG: send_request_body.started request=<Request [b'POST']>
> > > > 2024-10-26 11:17:25,941 DEBUG: send_request_body.complete
> > > > 2024-10-26 11:17:25,941 DEBUG: receive_response_headers.started request=<Request [b'POST']>
> > > > 2024-10-26 11:17:26,186 DEBUG: receive_response_headers.complete return_value=(b'HTTP/1.1', 429, b'Too Many Requests', [(b'Date', b'Sat, 26 Oct 2024 15:17:26 GMT'), (b'Content-Type', b'application/json; charset=utf-8'), (b'Content-Length', b'407'), (b'Connection', b'keep-alive'), (b'vary', b'Origin'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'450000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'449999'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_db0b7d807e30e0f8947b8f612b808201'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains; preload'), (b'CF-Cache-Status', b'DYNAMIC'), (b'X-Content-Type-Options', b'nosniff'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8d8b69c8bbd4a25d-YUL'), (b'alt-svc', b'h3=":443"; ma=86400')])
> > > > 2024-10-26 11:17:26,187 INFO: HTTP Request: POST [https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919122388%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=G3bHr8dbDpuhmQGR7fpkAcqrynznhCLnUy%2BX327p6Nc%3D&reserved=0) "HTTP/1.1 429 Too Many Requests"
> > > > 2024-10-26 11:17:26,188 DEBUG: receive_response_body.started request=<Request [b'POST']>
> > > > 2024-10-26 11:17:26,188 DEBUG: receive_response_body.complete
> > > > 2024-10-26 11:17:26,189 DEBUG: response_closed.started
> > > > 2024-10-26 11:17:26,189 DEBUG: response_closed.complete
> > > > 2024-10-26 11:17:26,190 DEBUG: HTTP Response: POST [https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919139197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6K5vPO1NNroMwbHkzRDfx%2BDwDtPDyQ3r4K4BN%2B3V3QU%3D&reserved=0) "429 Too Many Requests" Headers({'date': 'Sat, 26 Oct 2024 15:17:26 GMT', 'content-type': 'application/json; charset=utf-8', 'content-length': '407', 'connection': 'keep-alive', 'vary': 'Origin', 'x-ratelimit-limit-requests': '5000', 'x-ratelimit-limit-tokens': '450000', 'x-ratelimit-remaining-requests': '4999', 'x-ratelimit-remaining-tokens': '449999', 'x-ratelimit-reset-requests': '12ms', 'x-ratelimit-reset-tokens': '0s', 'x-request-id': 'req_db0b7d807e30e0f8947b8f612b808201', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'cf-cache-status': 'DYNAMIC', 'x-content-type-options': 'nosniff', 'server': 'cloudflare', 'cf-ray': '8d8b69c8bbd4a25d-YUL', 'alt-svc': 'h3=":443"; ma=86400'})
> > > > 2024-10-26 11:17:26,191 DEBUG: request_id: req_db0b7d807e30e0f8947b8f612b808201
> > > > 2024-10-26 11:17:26,191 DEBUG: Encountered httpx.HTTPStatusError
> > > > Traceback (most recent call last):
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/openai/_base_client.py", line 1037, in _request
> > > > response.raise_for_status()
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
> > > > raise HTTPStatusError(message, request=request, response=self)
> > > > httpx.HTTPStatusError: Client error '429 Too Many Requests' for url '[https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919154284%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=lpw8PXk6NgYSt31ZRCYn9wUZ6YQmqojujdhPnztY7kw%3D&reserved=0)'
> > > > For more information check: [https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FHTTP%2FStatus%2F429&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919168262%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=aEM72r%2BQ7UEyBNCo%2FCi%2FyeQr203ut3UrqKd0ZOMoEvY%3D&reserved=0)
> > > >
> > > > During handling of the above exception, another exception occurred:
> > > >
> > > > Traceback (most recent call last):
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/openai/_base_client.py", line 1037, in _request
> > > > response.raise_for_status()
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
> > > > raise HTTPStatusError(message, request=request, response=self)
> > > > httpx.HTTPStatusError: Client error '429 Too Many Requests' for url '[https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919186668%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=keH0rsboi7S6tr0SmMV%2F26vG0WEoS8EdUgi0Iu0agXk%3D&reserved=0)'
> > > > For more information check: [https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FHTTP%2FStatus%2F429&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919201642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6Az6IaDkg3MZwt3WDpzwTWyMLe15WFvX%2BD8kyTvjWMY%3D&reserved=0)
> > > >
> > > > During handling of the above exception, another exception occurred:
> > > >
> > > > Traceback (most recent call last):
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/openai/_base_client.py", line 1037, in _request
> > > > response.raise_for_status()
> > > > File "/home/mdear/workspaces/venv/captions/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
> > > > raise HTTPStatusError(message, request=request, response=self)
> > > > httpx.HTTPStatusError: Client error '429 Too Many Requests' for url '[https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919216539%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=p3dfYbK7S3IN2JVjf7Pk8FfNjvFcXbKoYaP%2ByeH2Rmc%3D&reserved=0)'
> > > > For more information check: [https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.mozilla.org%2Fen-US%2Fdocs%2FWeb%2FHTTP%2FStatus%2F429&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919231245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=yBqYKHy%2FCaH%2FjGkVMrC8ho0k%2FnAoiSxsVUSwpggqrMs%3D&reserved=0)
> > > > 2024-10-26 11:17:26,193 DEBUG: Re-raising status error
> > > > 2024-10-26 13:10:24,810 ERROR: Failed to perform inference for ./1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4: Error code: 429 - {'error': {'message': 'Request too large for gpt-4o in organization org-ubsHoc4FIth7nvLB2MJnvyza on tokens per min (TPM): Limit 450000, Requested 564508. The input or output tokens must be reduced in order to run successfully. Visit [https://platform.openai.com/account/rate-limits](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplatform.openai.com%2Faccount%2Frate-limits&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919245473%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=8ZFtt2BQTEkt1j7Yn2Pq%2BSEZHpQ8eqxeX4hEi8Db%2BiM%3D&reserved=0) to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >

mdear · October 27, 2024, 11:54pm

On 2024-10-26 11:16 a.m., Myles Dear Hotmail wrote:

Now I’m getting token rate limit errors and “too many requests” errors.

I shrunk my input context window by a factor of sixty (calculated based on vision documentation token budget, assuming base64 encoded image doesn’t contribute to token budget, but rather only the vectorized version actually presented to the model) so that my request now contains only one image, and still I’m not able to get inference response. I should be able to pass 30-60 images in a single inference call.

Please help. How can I run a business with these kinds of limitations ?

I’m still shocked that the image_url interface is not working, I spent weeks of effort to prepare Dropbox links to prevent having to send this volume of data in the API request body, but here we are.

First, the output, then lower down I’ll show the input sent on the API to produce this output:

> > > > > 2024-10-26 10:39:50,085 ERROR: Failed to perform inference for ./1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4: Error code: 429 - {'error': {'message': 'Request too large for gpt-4o in organization org-ubsHoc4FIth7nvLB2MJnvyza on tokens per min (TPM): Limit 450000, Requested 4944222. The input or output tokens must be reduced in order to run successfully. Visit [https://platform.openai.com/account/rate-limits](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplatform.openai.com%2Faccount%2Frate-limits&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919259580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=fnJZISCCHYSmhMIgDS9a1BOnPbV4s1sIjW1SfaPS7cU%3D&reserved=0) to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

With my current Tier 2 limitations, I can literally only process a single keyframe at a time !?!?!? With the image_url based design I should have been able to process 30 to 60 images in a single call.

My design involves stripping down vocational training videos in the wheelchair custom seating vertical, into subtitles derived from the audio track (and their translations to several other languages) and textual descriptions of images passed to the inference API which are saved and fed back, in order to process a 3-5 minute video in chunks. There are around 200 keyframes in a video that must be processed. Processing them one at a time will be insanely expensive.

Could you, at the very least, bump me up to a higher tier so I don’t have these kind of crazy limitations, or better yet, fix the image_url feature so it works with my Dropbox links ?

mdear · October 27, 2024, 11:55pm

> > > > > 2024-10-26 10:30:30,132 DEBUG: Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': '\n You are tasked with processing wheelchair seating vocational training videos. Your goal is to analyze and summarize the video. \n\n The expected output will include:\n - "file_commentaries": Commentary about the video as a whole.\n - "subtitles": Subtitles extracted from the video, tied to specific timestamps.\n - "image_summaries": A summary of keyframes extracted from the video, based on timestamped keyframe URLs.\n - "commentaries": A section that ties together insights derived from subtitles and keyframes.\n - "source": Details about the video source, such as filename or other relevant details.\n - "metadata": Any additional metadata required to describe the context.\n\n The expected input will include:\n - A list of keyframe URLs, localized by timestamp, that represent important visual elements of the video.\n - A set of subtitles, localized by timestamp, that convey important spoken content.\n - Summaries produced by previous inference runs on a given video (empty for the first inference run)\n\n Input and output formats are valid JSON objects.\n\nA few specific details to consider:\n- You will be presented with a chronological series of subtitles in multiple languages. \nFor each language, ensure you rebuild the narrative to ensure continuity is maintained. For example, if one subtitle says "It\'s really cool actually how you can convert a body point padded one" and the next subtitle says "and a half inch belt with plastic side released buckle on it" you should come to the conclusion that the trainer is showing a one and a half inch belt, not a half inch belt. Precision is extremely important to maintain when you are summarizing. \n\n\n\n\nHere is a list of abbreviations that the content producer uses to compose video filenames and series names. It is in a form of a table in which the abbreviation is further explained. Use this information to better understand the intention of any video or series name.\n\nAbbreviation List\nASBS : ASSEMBLY BRACKETS STEEL (FLATS)\nASBA : ASSEMBLY BRACKETS ALUMINUM (FLATS)\nBC : BACK COVER\nBCU : BACK CUSHION\nBINTH : BACK INTERFACE HARDWARE\nBINT : BACK INTERFACE\nBT : BLACK TRAY\nCFS : CALF SUPPORT\nCH : CUP HOLDER\nCHAC : CHAIR ACCESSORIES\nCHR : CUSTOM HEADREST\nCOHR : COMMERCIAL HEADREST\nCH : CUP HOLDER\nCOB : COMMERCIAL BACK\nCUP : CUSTOM POSITIONING STRAP\nCOP : COMMERCIAL POSITIOINING STRAP\nCOS : COMMERCIAL SEAT\nCT : CLEAR TRAY\nCOMP : COMPRESSION SPRING\nFTB : FOOTBOX\nFTR : FOOTREST\nFIP : FOAM IN PLACE\nHLT : HINGED LAP TRAY\nLAT : LATERAL\nMOB : MOLDED BACK\nMODS : MODIFICATIONS\nPAD : HANGERS AND ARMREST\nP - bjb : Portable bottle jack bender\nREP : REPAIRS\nROHO : SEAT AND BACK BOLSTERS\nSB : SUPPORT BRACKETS\nSC : SEAT COVER\nSCU : SEAT CUSHION\nSINT : SEAT INTERFACE\nSKI : SIT SKI\nSLF : SLENDERFENDER FIT KIT\nTEM : TRAY EASY MOUNT\nTSEM : THREAD SLED EASY MOUNT (TOOL LESS ADJUSTMENT)\n\n\nThe user prompt (including the contents of the content/text block of the user role) shall be in JSON format, as per the following specification:\n\nThe user input prompt format is as follows:\n{\nrole: user,\ncontent : [\n{type: text ,\ntext : "{\n\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video\'s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a cumulative summary of the video from its beginning to the currently analyzed video chunk. It may be empty if this is the first chunk of the video being analyzed and no prior inference has produce commentaries thus far. This input consists of previously generated summary output from the current video?s previously analyzed chunks (if any) to ensure all file summary input given so far is considered when asking for a new summary to be built considering the current video chunk. \n{file_commentary : # A string field containing a cumulative summary of the video, ultimately ensuring that the file is summarized using all subtitle and image keyframe data presented this far. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\nsubtitles: [ # This list may be empty if the earliest image timestamp is less than the first subtitle timestamp.\n {timestamp : #string, in srt format\n subtitle : #contents of subtitle generated from the audio track of the video \n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\nimage_summaries: [# Initially empty, this contains a summary of each image analyzed thus far. The summary must be based on the visual elements of the actual image or images included and must only describe actual visual elements present in the images. The number of objects in this list for a given timestamp is expected to be the number of languages for which subtitles are provided in the user input context. The purpose of this list is to provide context to generate commentaries and cumulative file summaries that draw from multimodal subtitle and image inputs. If this is the first chunk of a video this field may be empty. \n{image_summary: # A detailed textual summary of the image that would contain enough information to teach a skilled intern how to accurately and correctly imitate the skill being demonstrated. Include the items seen in the image, the tools being used, the products being used and worked on, the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Locate the current step in an overarching set of steps and phases similar to a table of contents as many videos and series of videos represent an ordered sequence of vocational skills to accomplish a goal. Even if this summary is viewed out of order it should contain enough detail to locate it in a series of steps. For example : "now that x, y and z are complete as part of phase n, the teacher is now working on step w". Include the motion being represented (ie, cutting, gluing, attaching, bending, punching, sanding, welding) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\nlanguage : #iso639-1 two-character language of summary\n}\n],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image_summaries . Commentaries are used as the sole system context when generating e_learning content and the generated e_learning content is the sole input to fine-tuning in pass2 so each generated artifact must faithfully capture the essence of the multimodal input provided from a vocational training point of view. If this is the first chunk of a video this field may be empty. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from inage summaries derived from keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. To clarify, summarization is not considered to be enhancement but rather distilling existing content. Enhancement is considered to be adding original creative content to preexisting content. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata is provided in all required languages. If this is the first chunk of a video this field may be empty. \n{\nseats_products: [ # cumulative list of the names of seats products referenced in the video so far taken from [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919274184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=M1E%2FqoMW5Hu0wN2bzHWLEWTJqJ4KVNeep1Zac6SEr9I%3D&reserved=0). \n {seats_product: # For example, Thread Sled Easy Mount (TSEM)\n\n#The seats_product metadata keyword must contain a reference to one of the products referenced in the following list of Seats products, accurate as of October 2024, which contains for each product a one line description suffixed by the product URL. This is a summary of [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919289004%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giQDwNyIgDfJkXdUIgVR0B%2FGkYfdRElx4f1jPJhw2eU%3D&reserved=0). \n# 22.5 Degree Disc Assembly - Modular mounting system for seating components. URL: [https://seatshardware.com/products/22-5degdiscass\n#](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2F22-5degdiscass%2Fn%23&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919302456%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=VUAAA3zOi3Z79Wg5MIN87dhokdrPD1QS3KycdR0I0l0%3D&reserved=0) : # url of product referenced, for example, [https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2Fportable-bottle-jack-bender%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919728346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=TdcEhAbQOD9ZKmZMDMBMsDpplqe1faEkggCUVYooy%2F8%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products :\n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : [https://www.sunrisemedical.com/manual-wheelchairs/quickie](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sunrisemedical.com%2Fmanual-wheelchairs%2Fquickie&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919740037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=SXCdKIK8ZVAIlKVSzPdtpkAM1ktXuwVLxVxTKTQ3p%2Fk%3D&reserved=0) . Another example : [https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bodypoint.com%2FECommerce%2Fproduct%2Fevof%2Fevoflex-%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919754445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=lPNzgiRowI%2FTNeqvuwreb9Ktokkr%2F9dCATUE4k14uI8%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n}}"},\n\n{type: image_url,\nimage_url : { url : dropbox url to image},\ntimestamp : string, in srt format, accurate to the millisecond }\n]\n}\n\nThat covers the input format.\n\n\n\n\nRegarding token space management if token space becomes tight, blocks of potentially redundant image summaries may be consolidated by expressing a summary with a list of timestamps. \n\nBe careful to not over summarize however. If the keyframe differences are important in order to properly capture the required vocational skill being used at that moment then avoid removing detail. For example, cutting foam at two different angles may be crucial to the skill being demonstrated. Cutting foam the exact same way may be considered duplication. \n\nAlso, note that identical steps could exist in many different processes so if you collapse identical steps if their context differs then combine details appropriately. For example, the same kind of cut could be made to a piece of foam as part of a new cushion or a cushion repair. If two such summaries are combined then the summary must indicate this step could be executed as part of a cushion creation or repair. \n\nEnsure that any summarization is done only on truly redundant data. \n\n\ngenerate an output in the following JSON format:\n{\nfile_path : # a unique name for this video that contains both path and file name in the format series_name/this_video.mp4. The purpose of this field is to organize video metadata in such a way to allow multiple video?s data to reside in the same data structure to aid front end searching and filtering. \n\nfile_commentaries : [ # A list of summary blocks, the number of blocks shall be the number of languages present in the input context subtitle text input. The purpose of this structure is to provide a summary of the video from the beginning to the most recently analyzed video chunk. This output feeds back in to future inputs to ensure all subtitle and image input given so far is properly summarized. This field is cumulative and is expected no grow over time as file summaries produced are combined with file summaries already present in the input as each video chunk is processed. This summary is expected to be produced from both subtitle and image multimodal inputs. \n{file_commentary : # A string field containing a summary of the video, utilizing data from both the subtitles and image keyframes in the input context and merging with the previous file summary if provided in the input context, ensuring that the commentary covers the entire video up to this point. \nlanguage : The iso 639-1 two-character language abbreviation}\n],\n\nimage_summaries : # A list of objects describing each image present in the input user context. For each image timestamp, one object for each subtitle language is expected to be described. \n[\n{\ntimestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if an image summary is deemed to apply to multiple keyframes to achieve token space compression without loss of expressive and educational accuracy and detail.\n\nimage_summary: # A detailed summary of the image that would contain enough information to teach a skilled intern how to imitate the skill being demonstrated. The summary must be based on the visual elements of the actual image or images included and must only describe visual elements actually present in the images. Include the items seen in the image, the tools being used, the products being used and worked on, the the vocational skill being demonstrated, and the purpose of that skill relative to the purpose of the video up to the current point. Include the motion being represented (ie, cutting, gluing, attaching, bending) and the kind of object being acted on (ie, headrest, seat cushion, wheelchair back, lateral support bracket). \n\nlanguage : # The iso 639-1 two-character language abbreviation\n}],\n\ncommentaries: [ # presented in each required language for each indicated point of the video. This list is expected to increase in length as analysis of a video proceeds and is meant to replace the subtitles and image summaries without loss of the intrinsic training detail. Commentaries are used as the sole system context to produce e_learning content which in turn are the sole input when doing fine-tuning in pass2 so they must faithfully capture the vocational training essence of the multimodal input provided. Sequences of operations must be broken up into logical steps and each commentary must include enough contextual detail to stand on its own if accessed directly by a user who does not consult neighbouring commentary blocks. \n {\n timestamps : [], #list of srt formatted timestamps. More than one timestamp may be present in the list if a commentary is deemed to apply to multiple positions in the video to achieve token space compression without loss of expressive and educational accuracy and detail.\n commentary: # contents of the commentary. It is synthesized multimodally both from subtitle input but also from textual summaries of keyframe input and is expected to be a better representation and a truer summary of what is being taught in the video at this point than could be elicited from either of the modes individually. \n skills: [], # A list of keywords detailing vocational skills demonstrated in this commentary. Some examples include foam_cutting, precision_cutting, angle_alignment.\n language : # two character standard abbreviation of subtitle language, for example "en" is english, "fr" is french, "es" is for spanish as per iso 639-1\n }\n],\n\nsource: # A text string indicating the origin of this content. Options include HumanCreated, AiEnhanced, AiCreated. Assume HumanCreated by default unless instructed otherwise. \n\nmetadata: # cumulative information about the contents of the file processed this far meant to enhance front end user requested filtering. Metadata must be generated in all required languages. \n{\nseats_products: [{\nseats_product: # cumulative list of the names of distinct seats products referenced in the video so far from [https://seatshardware.com/collections/all](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fcollections%2Fall&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919773529%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=81sZYjVavJbSkcJ1WZVYWBP7wLYfykEh9RIjH4pveHI%3D&reserved=0). For example, Thread Sled Easy Mount (TSEM). This list must not contain duplicates. \n\nproduct_url: # url of product referenced, for example, [https://seatshardware.com/products/portable-bottle-jack-bender\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fseatshardware.com%2Fproducts%2Fportable-bottle-jack-bender%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919790051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=XuLn4NE97PD%2BmKoLm8QZ3BrBbnrLnoDn8SC89GQjxJU%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n}\n],\n\nnon_seats_products : # This list must not contain duplicates. \n[\n{\nnon_seats_product: # name of third-party non-seats product referenced in video. One example could be a Sunrise Medical Quickie wheelchair base. Another example could be a padded Bodypoint strap.\n\nproduct_url: # url of product referenced. For example : [https://www.sunrisemedical.com/manual-wheelchairs/quickie](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sunrisemedical.com%2Fmanual-wheelchairs%2Fquickie&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919806259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6w2ilN9CTSSAWwHgp6dTcoTgGW7VHCDCoRaMIFNO4iQ%3D&reserved=0) . Another example : [https://www.bodypoint.com/ECommerce/product/evof/evoflex-\n\nlanguage](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.bodypoint.com%2FECommerce%2Fproduct%2Fevof%2Fevoflex-%2Fn%2Fnlanguage&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919820066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Yyrq3kjfbM7x8%2Bs%2BLpUKkcw8gAaajkJEnJn6kaRYl%2B4%3D&reserved=0) : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nsearch_keywords : # cumulative list of search terms generated for the video \n[\n{\nsearch_keyword: # A keyword or short multi word phrase that will point to this video if typed by a user in a search bar\n\nlanguage : # The iso 639-1 two-character language abbreviation pertaining to the product description and url\n\n}\n\n],\n\nskills: [], # A list of keywords detailing vocational skills demonstrated in this video. Some examples include foam_cutting, precision_cutting, angle_alignment.\n\n}\n}\n\n'}, {'role': 'user', 'content': '"[{"type": "text", "text": "{\\"file_commentaries\\": [], \\"subtitles\\": [{\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hello and welcome to Seats. I want to show you a few videos on something I\'m\\", \\"language\\": \\"en\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Bonjour et bienvenue \\\\u00e0 Seats. je veux vous montrer quelques vid\\\\u00e9os sur quelque chose que je suis\\", \\"language\\": \\"fr\\"}, {\\"timestamp\\": \\"00:00:00,000 --> 00:00:06,000\\", \\"text\\": \\"Hola y bienvenido a Seats. quiero mostrarle algunos videos sobre algo que soy\\", \\"language\\": \\"es\\"}, {\\"timestamp\\": \\"00:00:06,000 --> 00:00:13,600\\", \\"text\\": \\"working on. It\'s really cool actually how you can convert a body point padded one\\", \\"language\\": \\"en\\"}, {\\"timestamp\\": \\"00:00:06,000 --> 00:00:13,600\\", \\"text\\": \\"C\'est vraiment cool en r\\\\u00e9alit\\\\u00e9 comment vous pouvez convertir un point de corps tap\\\\u00e9 un\\", \\"language\\": \\"fr\\"}, {\\"timestamp\\": \\"00:00:06,000 --> 00:00:13,600\\", \\"text\\": \\"Es realmente cool, en realidad, c\\\\u00f3mo se puede convertir un punto de cuerpo en un\\", \\"language\\": \\"es\\"}], \\"image_summaries\\": [], \\"commentaries\\": [], \\"source\\": \\"HumanCreated\\", \\"metadata\\": {}, \\"file_path\\": \\"HOW TO CONVERT A PADDED BODYPOINT STRAP INTO WRIST CUFFS FOR BOTH YOUR CLIENT\'S SAFETY AND HYGIENE (Video\'s 1-9RCTCOP)/1RCTCOP - A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL - 2023-02-11 001.mp4\\"}"}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... RXxVvlAXRwKG8qX1V8UnH8B3jby3fY1LwAAAAAAElFTkSuQmCC""", "detail": "high", "timestamp": "00:00:02,906"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... zKMgWygJm50iQqEPnFoP4HgVSfWehQ4pYAAAAASUVORK5CYII=""", "detail": "high", "timestamp": "00:00:04,841"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... /P8fF8PufY3WouPtkXjo7/C3XJcFqQRI2YAAAAAElFTkSuQmCC""", "detail": "high", "timestamp": "00:00:05,808"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... RGuYCqWmsBanc7MqqKN/8DrcTf3hLt4pEAAAAASUVORK5CYII=""", "detail": "high", "timestamp": "00:00:06,775"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... fveNwHoCZvZua6rpnZ3Ur9P3cPNV7e262GAAAAAElFTkSuQmCC""", "detail": "high", "timestamp": "00:00:07,742"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... pA5bEEzAwwyKFWQKUCu8vxvw28Y4xC74xQAAAAAElFTkSuQmCC""", "detail": "high", "timestamp": "00:00:08,709"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... 7vmeGo1ApQK6DimBkV+DfDfEqforwJwwAAAABJRU5ErkJggg==""", "detail": "high", "timestamp": "00:00:09,676"}}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAgAE ... [truncated] ... YzgO72d/EBWGupPHEw1P8C8mGnjhC1bJQAAAAASUVORK5CYII="", "detail": "high", "timestamp": "00:00:10,643"}}]'}], 'model': 'gpt-4o', 'frequency_penalty': 0, 'max_tokens': 15000, 'presence_penalty': 0, 'temperature': 0.2}}
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >

mdear · October 27, 2024, 11:55pm

On 2024-10-26 6:25 a.m., Myles Dear Hotmail wrote:

Your buggy feature has cost me significant time as apparently there is no way I can use Dropbox raw links to provide images to multimodal inference. This would have been good to know two weeks ago as I’ll now have to toss two weeks of work into the trash, while my customer is waiting.

I am NOT a happy customer of OpenAI right now.

I discovered today that if I directly send image data via base64 (ignoring your obviously buggy cookbook example where you show specification of links for images to analyze) that the base64 data is not counted towards the input token cost, but rather the token cost of the vectorized data is the total applied, your documentation provides me with a way to estimate these costs to ensure I build enough content to avoid exceeding the input token length.

I expect MUCH better customer service in the future.

mPwrWare (11994808) Canada Inc

On 2024-10-25 6:11 p.m., Myles Dear Hotmail wrote:

Hello,

Could there be token issues? I’m reserving 10% of the 128K token space as a buffer when I’m packing my requests. If the costs of image tokens cause the input token cap to be exceeded, will the API return an appropriate error code or will it silently fail or, worse, ignore the image and make radically wrong assumptions like I see it doing? I am calculating my token budget based on the tiktoken output from a set of multimodal inputs which include multilanguage closed caption text and image_url based image specifications. When the API receives this input, am I charged for both the space consumed by the image_url blocks, or are the image_url blocks stripped out, vectorized and I am then charged only for the vectorized input derived from the image url specified as explained in the vision documentation ?

I am not receiving any token input cap exceeded errors, but output clearly shows the model is guessing (mostly incorrectly) at the contents of the images. This is extremely unacceptable.

I checked the vision documentation and I don’t see any red flags here.

I’ll paste below the information I passed into the OpenAI chat message:

I tried the following test via API with the following disappointing result (the AI appears to be guessing and is not showing evidence of being able to “see” and interpret the image).

I have a Dropbox Professional subscription, which allows me to request permanent shared links for my images and tune their expiry time.
The Content-Disposition of the Dropbox link is not set to “attachment” and thus loads directly in my browser without any other visual elements (just the raw image). This aligns with the URL example shown in the cookbook article.

system_prompt=‘Describe the provided image in detail’

> > > > > > > user_prompt : '[{"type": "image_url", "image_url": {"url": "[https://www.dropbox.com/scl/fi/kdj5vu2ld7yqfkyhayvtd/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A02-2C906.png?rlkey=l4xk57vra2us62n74z51h1eqv&raw=1](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fscl%2Ffi%2Fkdj5vu2ld7yqfkyhayvtd%2Fkeyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A02-2C906.png%3Frlkey%3Dl4xk57vra2us62n74z51h1eqv%26raw%3D1&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919834445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ZU%2BFUa0l3IThh%2Bxg0y3rNmVzgjOx3ies6bDEYwRz17E%3D&reserved=0)", "detail": "auto", "timestamp": "00:00:02,906"}}]'
> > > > > > >
> > > > > > >

Response from the API-based inference call:

> > > > > > > Request: POST [https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919853167%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=LzHuXvdpnik4g4tBosDiZoX%2BcM9lepcENQdoCm3aGNI%3D&reserved=0) "HTTP/1.1 200 OK"
> > > > > > > 2024-10-24 06:56:53,940 DEBUG: receive_response_body.started request=<Request [b'POST']>
> > > > > > > 2024-10-24 06:56:53,941 DEBUG: receive_response_body.complete
> > > > > > > 2024-10-24 06:56:53,941 DEBUG: response_closed.started
> > > > > > > 2024-10-24 06:56:53,942 DEBUG: response_closed.complete
> > > > > > > 2024-10-24 06:56:53,942 DEBUG: HTTP Response: POST [https://api.openai.com/v1/chat/completions](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapi.openai.com%2Fv1%2Fchat%2Fcompletions&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919873711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=szHtAecC1T3ZPjIx%2BXWzclbRLuTwLXg8mzDFVQ9O2eM%3D&reserved=0) "200 OK" Headers([('date', 'Thu, 24 Oct 2024 10:56:58 GMT'), ('content-type', 'application/json'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('access-control-expose-headers', 'X-Request-ID'), ('openai-organization', 'user-m8spcgbdft4mtc5walogmvdr'), ('openai-processing-ms', '2070'), ('openai-version', '2020-10-01'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('x-ratelimit-limit-requests', '5000'), ('x-ratelimit-limit-tokens', '450000'), ('x-ratelimit-remaining-requests', '4999'), ('x-ratelimit-remaining-tokens', '441827'), ('x-ratelimit-reset-requests', '12ms'), ('x-ratelimit-reset-tokens', '1.089s'), ('x-request-id', 'req_46e3f263ce84433cb1f48a0f0ec75cf8'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=mYypsW2UOqi6KEJGeP1mOazj67fYWLgYZOpMbocmeIU-1729767418-1.0.1.1-mkkSt38CTrKU_Y8BqJYnZhXpLZ0IYKWyZx3hNDOYxf1YnceWw.q9prwayYnKsclLNG2_C0mSaTWTGsVlBaLdOg; path=/; expires=Thu, 24-Oct-24 11:26:58 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('x-content-type-options', 'nosniff'), ('set-cookie', '_cfuvid=D4HPC0lCzkrrZpWeCK84jxh2IxyDN2ooI6RiTVaGqto-1729767418467-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8d79716efec4aace-YYZ'), ('content-encoding', 'gzip'), ('alt-svc', 'h3=":443"; ma=86400')])
> > > > > > > 2024-10-24 06:56:53,943 DEBUG: request_id: req_46e3f263ce84433cb1f48a0f0ec75cf8
> > > > > > > 2024-10-24 06:56:53,948 ERROR: Invalid JSON detected : Expecting value: line 1 column 1 (char 0)
> > > > > > > 2024-10-24 06:57:07,339 INFO : Inference response : The image appears to be a keyframe from a video titled "A BRIEF DISCUSSION WHY WE CHOOSE THIS BELT DESIGN. FOR RACHEL 2023-02-11." The timestamp on the image is 00:00:02,906, indicating it is taken very early in the video. The image likely shows a belt design that is being discussed in the video.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The belt design might be central to the image, possibly showcasing its unique features or elements that are being highlighted in the discussion. The background and surrounding elements are not described, but they might include visual aids or text that support the discussion about the belt design. The image is likely intended to visually communicate specific aspects of the belt that are being discussed, such as its style, materials, or functionality.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The context suggests that the image is part of a presentation or explanation aimed at someone named Rachel, possibly to inform or persuade her about the choice of this particular belt design.

In stark contrast, when I upload the very same image to ChatGpt and request a visual analysis I receive the following response:

> > > > > > > The image you provided shows what appears to be a strap or belt design. Here's a detailed analysis based on the visible characteristics:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Material and Build:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The strap seems to be made of a durable synthetic fabric, possibly nylon, given the weave and texture. The material appears to be designed for heavy-duty or industrial use, as it looks reinforced and durable.
> > > > > > > Components:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Two D-rings are visible, both likely made from metal or strong plastic. These rings are often used for adjustable or secure fastening systems.
> > > > > > > The strap also features a buckle mechanism, possibly to adjust the length or fasten it securely. This buckle seems designed for strength, suitable for applications where the strap needs to bear weight or resist tension.
> > > > > > > Design Details:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > There is a small logo or symbol on the strap. It’s a circle with stylized text inside, which may represent a brand or manufacturer.
> > > > > > > The stitching looks robust, reinforcing the overall structure. It suggests that the strap is designed for reliability and wear resistance.
> > > > > > > Use Case:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The design, materials, and hardware suggest this is part of an adaptive or positioning device, possibly for custom wheelchair seating or mobility aids. Given the D-rings, it may be used for securing or adjusting something in place, such as a seatbelt or positioning strap.
> > > > > > > Condition:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The strap looks new or barely used, with no visible signs of wear, fraying, or damage.
> > > > > > > This type of strap is typically found in applications where both strength and adjustability are required, such as custom seating solutions for mobility devices or rehabilitation equipment. It’s likely part of a larger system designed for secure, comfortable, and adjustable support.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >

(a similar response was provided in the chat as to what you provided in this email)

That’s not really that helpful, folks. I’m directing my requests to gpt4o from both the API and ChatGpt side. The hesitant language returning from the model indicates that it was not able to access the link, yet this link was created to be publicly accessible with no authentication and I tested the link immediately prior to calling the API. You saw the differences between the API and ChatGpt’s responses, right? Night and day.

I am charging customers for doing AI based image interpetation that must be done via API, I’m following the cookbook and getting unacceptable results. Please try this for yourself and get back to me with some real answers.

I’ve refreshed one of my Dropbox image links, please try it yourself in your browser, and invoke an image analysis request via API in the same manner I did, and see if it’s successfully returning a meaningful and believable visual analysis.

[https://www.dropbox.com/scl/fi/l2upgcpjsdbnbarq5c8at/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A00-2C000.png?rlkey=j92zuc5iiyrnw57xp9oobqur2&st=gso0nzih&dl=0](https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fscl%2Ffi%2Fl2upgcpjsdbnbarq5c8at%2Fkeyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A00-2C000.png%3Frlkey%3Dj92zuc5iiyrnw57xp9oobqur2%26st%3Dgso0nzih%26dl%3D0&data=05%7C02%7C%7C5fcfa75fc3cd47ba47df08dcf5f2af5f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638655669919887813%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=x0LIBvURuYY8n0lstfr%2FhmZy8eKVDgxpIanODSECVFQ%3D&reserved=0)

Seriously, this is a quick test, it probably took you longer to type your response then to actually try it and help me debug this. Please.

mdear · October 27, 2024, 11:56pm

Correction, my design specifies the use of the raw link variant, namely : https://www.dropbox.com/scl/fi/l2upgcpjsdbnbarq5c8at/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A00-2C000.png?rlkey=j92zuc5iiyrnw57xp9oobqur2&st=gso0nzih&raw=1

And, for the record, I’m already providing additional context in my prompt that come from the spoken words of the vocational skill trainer. As stated in the cookbook, by enhancing that detail with detail elicited from the visual frame grabs localized in time close to the verbal remarks, a better consolidated description of the action may be constructed from the multimodal sources. That’s the secret sauce I need to sell to my customer.

October 25
Any reply? My customer is waiting.
2h ago. Not seen yet

Thanks,

mPwrWare (11994808) Canada Inc

(613) 703-4152

On 2024-10-24 7:54 a.m., OpenAI from OpenAI wrote:

Hi Myles,

Thank you for reaching out and providing detailed information about the issue you’re encountering with the API-based multimodal image interpretation. Based on the information you’ve shared, it seems like you’ve followed the steps correctly as outlined in the OpenAI Cookbook for utilizing GPT-4 with Vision capabilities. However, the discrepancy in the level of detail between the API response and the ChatGPT response is indeed concerning.

The difference in responses could be attributed to several factors, including the specific model version used, the detail parameter setting for the image processing, or even the way the image URL is being processed by the API. Given that you’ve already ensured the Content-Disposition of the Dropbox link aligns with the requirements and the image loads directly in the browser, the issue might lie elsewhere.

Here are a few steps you can take to troubleshoot and potentially resolve the issue:

Model Version: Ensure that you’re using the same model version in both the API call and ChatGPT. Different versions of the model may have varying capabilities or interpret the prompts differently.

Detail Parameter: Experiment with the “detail” parameter in your API call. Setting it to “high” might yield a more detailed analysis, similar to what you’re experiencing with ChatGPT.

Image Accessibility: Double-check that the image URL is publicly accessible and not restricted. Sometimes, sharing settings or permissions can affect how the API retrieves and processes the image.

API Key Permissions: Verify that your API key has the necessary permissions and hasn’t reached any usage limits that might restrict its functionality.

Review API Documentation: Revisit the Vision - OpenAI API documentation to ensure there haven’t been any updates or changes to the API that might affect how image URLs are processed.

If after trying these steps you’re still facing issues, it might be helpful to directly reach out to OpenAI support with specific details of your API request and the responses you’re receiving. They might be able to provide more insight into what’s happening behind the scenes or if there’s a specific aspect of the image processing that’s causing the discrepancy.

Please let us know how it goes, and if there’s anything else we can assist you with.

Best,
OpenAI Team

100x271180×320 9.52 KB

sps · October 28, 2024, 1:00am

The reason you’re not able to pass more than one image is because you’re not even passing the single base64 encoded image as image content block, you’re passing it as text, which is why you’re getting such a high token count.

Here’s python boilerplate code from docs to help you:

import base64
from openai import OpenAI

client = OpenAI()

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url":  f"data:image/jpeg;base64,{base64_image}"
          },
        },
      ],
    }
  ],
)

print(response.choices[0])

I highly recommend reading vision docs before you continue.

_j · October 28, 2024, 1:06am

The quality is not impaired by the rate limit mechanism.

The rate limiter is a simple estimate. The rate limiter is like a firewall. The only function is to block API requests from reaching AI models if the limit set or the limits of an organization are exceeded.

The language tokens are an estimate that are close but not the actual amount.
The images have a fixed rate consumption regardless of any settings: 771 tokens per image

Because it must block excessive requests, neither deep inspection nor advanced computation is used on the API request. That means that yes, some requests impact accumulated rate more than their AI consumption, while others have a lesser impact than true usage.

Despite all your posted evidence for some unknown theory, the rate limiter and you being rejected for exceeding it has nothing to do with the quality of vision or output.

When you are making requests correctly, in many cases, you can send multiple images in a single user message to obtain unique classifications or descriptions. The quality degrades after five or ten.

mdear · October 31, 2024, 1:39am

Thanks, I’ll make sure to limit each inference request with somewhere between five and ten. Advice appreciated.

mdear · October 31, 2024, 1:40am

Great catch! That’s why other sets of eyes are so important. When I sent the message blocks as native types rather than json text, I was able to make my requests as expected. Thanks so much for noticing this. Respect.

Topic		Replies	Views
Max_tokens not set, truncated return with "finish_reason": "stop" API gpt-4 , api	9	6747	April 24, 2024
Markdown Formatting Issues with GPT-5 API	18	7326	September 1, 2025
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4732	January 26, 2024
GPT-4 Vision - Maximum Amount of Images? API	5	18323	December 28, 2023
Custom GPT hallucination issues in my GPT GPT builders prompt , prompt-engineering , custom-gpt , custom-gpts , gpt5	15	435	September 29, 2025

Image based inference to gpt4o appears to be severely impaired due to excessively naive token counting

Related topics