Using the Batch-API for mass text generation

I’m planning a large-scale project to generate marketing texts for approximately 600,000 products using their EAN codes. For each product, I need to create several types of texts: prospectus headlines, advertising copy, and three different poster texts, each with specific character limits. I’m considering using the OpenAI Batch API with GPT-4o or GPT-4o-mini for this task. My questions are:

  • Is the Batch API suitable for processing this volume of data efficiently?
  • What would be the estimated cost for processing 600,000 EANs, assuming about 500 tokens per product?
  • Are there any best practices or recommendations for handling such a large-scale text generation project?
  • How can I ensure consistent quality across all generated texts?
  • Are there any rate limits or other technical considerations I should be aware of?
{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "Du bist ein Assistent für Produktbeschreibungen und Werbetexte. Generiere für jedes Produkt die angeforderten Texte unter Berücksichtigung der angegebenen Zeichenbeschränkungen."
    },
    {
      "role": "user",
      "content": "Generiere für jedes Produkt folgende Texte:\n1. Prospektüberschrift (max. 70 Zeichen)\n2. Prospektwerbetext (max. 215 Zeichen)\n3. Plakattext1 (max. 70 Zeichen)\n4. Plakattext2 (max. 35 Zeichen)\n5. Plakattext3 (max. 215 Zeichen)\n\nVerwende die gegebenen EAN und Artikeltexte als Basis für deine Generierung."
    }
  ],
  "requests": [
    {
      "ean": "4001796411748",
      "artikeltext": "Gipsplattenschrauben FG 3,9x25, 1000 Stk."
    }
  ]
}


and the output query was teswise like this:

{
  "responses": [
    {
      "ean": "4001796411748",
      "artikeltext": "Gipsplattenschrauben FG 3,9x25, 1000 Stk.",
      "texts": {
        "prospekt_ueberschrift": "Hochwertige Gipsplattenschrauben FG - 3,9 x 25mm, 1000 Stück Packung",
        "prospekt_werbetext": "Unsere FG Gipsplattenschrauben 3,9x25mm bieten höchste Qualität für professionelle Anwendungen. Mit 1000 Stück pro Packung sind Sie bestens ausgestattet für Ihre Projekte. Erleben Sie effizientes und präzises Arbeiten!",
        "plakattext1": "FG Gipsplattenschrauben: Professionelle Qualität, 1000 Stück, 3,9x25mm",
        "plakattext2": "FG Schrauben: Profi-Qualität!",
        "plakattext3": "FG Gipsplattenschrauben 3,9x25mm: Ihre erste Wahl für professionelle Trockenbauarbeiten. 1000 Stück pro Packung garantieren effizientes Arbeiten. Erleben Sie höchste Qualität und Präzision für Ihre anspruchsvollen Projekte!"
      }
    }
  ]
}

This Dataset was just randomly picked but the quality would be good in that way.

Any insights or advice from those who have experience with similar large-scale projects would be greatly appreciated

To be fair - I have never worked with the OpenAI-API so don’t mock my code, it may not work, these are just samples on how I want it to get “delivered”.

Hi and welcome to the Forum!

As a starting point, I’d strongly recommend reading through the detailed batch guide if you have not already.

Regarding your specific questions:

Yes, in general it’s suitable for this.

This depends on the model. Here’s the link to the latest pricing, which includes the pricing for the batch API.

https://openai.com/api/pricing/

You want to consider one or two-shot prompting to provide the model with a clearer idea of the desired output.

Yes. As you can also read up in the documentation, there are two types of limits in place. For a given batch you cannot submit more than 50k requests. Additionally, you have a batch queue limit which refers to the total number tokens which can be enqueued at any point in time. This limit depends on your Usage Tier.


Couple of other points:

  • Ensure that you format your requests in line with the prescribed format (see Batch Guide).

  • Models cannot properly count. Hence, there is no value in providing with such specific specific word counts for each output. As per above, examples one suitable way to provide guidance to the model about the style and length of the output in addition to the specific instructions in your prompt.

1 Like