Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete: Why the Future Is Automated Workflow Architecture with LLMs

I accept your posturing and wish you the best as well.

1 Like

Here is a great example of what I meant by

Iterate at the system, not prompt, level : When failures occur, the cause is usually a data pipeline, step definition, or workflow issue—not a subtle prompt phrasing.

Recently I’ve been working with my intern on an app that would automate processing of audio/videos from meetings (yes, another meeting notes app, as all I’ve seen so far is too raw yet for business quality notes and multi-documents generations from meetings between non native speakers, especially for long meetings spread across multiple recordings).

Approach (simplified version):

  • Platform manages custom vocabularies, projects, workflows etc.
  • When a user submits a file for speech-to-text conversion, they can define the dictionary to use and have some extra settings (optional) for the number of speakers, language etc.
  • File is sent to external api for speech-to-text conversion and speaker diarization( Gladia - best of what I see so far for efficiency/costs)
  • Platform gets notification about the job done, downloads the remote data (raw transcription and utterances with word confidence breakdown)
  • Platform, uses raw transcription text from Gladia, assigned dictionaries to generate a transcription specific vocabulary of terms potentially difficult for speech-to-text recognition (3-steps/LLMs workflow to generate the required context as a workable custom vocabulary definition)
  • Platform runs several LLMs (2-step/LLMs workflow) on utterances with transcription specific vocabulary and surrounding context to suggest a fix for badly recognized terms in the utterance (brand names, tech terms, domain names, proper names, tech jargon, etc.). This can be done manually via UI on utterance screen or in bulk from the transcription screen with an option to auto-accept suggestions (highlights modified utterances in the UI).
  • User can easily spot the suggested changes, accept, decline, revert them to go to the next workflows in the platform (all user interactions are collected to generate data for user-specific fine-tuning datasets to improve the quality over the time)
  • Platform analyzes the user interactions and suggests including “problematic” terms discovered during the transcription processing into the global dictionaries (to pass those to Gladia next time and reduce the post-processing for the future transcriptions, as Gladia will most likely use them out of the box to improve original quality).

All sounds good but…

When processing an utterance, the platform LLM operations work with human text, and are having a really hard time to switch from text to phonetics and operate on the pronunciation blocks rather then text blocks…

Example:

In transcription dictionary we have a term Weaviate (foreign language entity name) which is not consistently pronounced by speakers (french) and even if Gladia has it in the dictionary, depending on the speaker, we can get something like:

oui et viet

instead of

Weaviate

2-3 syllables, depending on the speaker, but as LLM is operating in text context and not phonetics, we could have spent hours refining the prompts to “push” it to phonetics (actually after 30 minutes fighting with the prompting/model choices I just dropped the idea).

No matter how you push the LLM with prompting, the final suggestion is never stable and falls back to “word to word” term replacements. So you can intermittently get any of:

  • Weaviate
  • oui et Weaviate
  • oui Weaviate

No matter the model (disclaimer: no fine-tuning yet; no reasoning models on this step as this would make the platform not competitive in pricing), it just doesn’t get how to count syllables and return the result needed (even if it correctly finds the replacement term from the custom dictionary).

And if I push it too hard, it starts failing (being too aggressive in term replacements) on other utterances, where Gladia did a great job.

And then I started looking into my workflow to mimic the human speech understanding (at least trying):

  • where the phrase comes in as an uninterrupted flow of sounds,
  • then the brain spots the patterns in sounds, then it cuts the phrase in the words,
  • then corrects the misheard sounds (by repeating the words internally, this is where it can fail and you literally “hear” what people did not actually say)
  • then applies language rules to get the final phrase as “text”,
  • then starts processing the meaning…

My workflow was missing the “uninterrupted flow of sounds” as an additional context for the utterance processing LLMs, required to switch the model operation mode from “text” to “phonetics” and allow the model to correctly map the term replacements.

From that point, it becomes easy:

  • add an extra field to the DB utterance table to store on utterance: phonetic_text
  • add an extra bulk processing step to the workflow of utterance storage in DB (when the platform downloads the transcription results from Gladia).
    – Get the raw text of the phrase and convert it to its phonetic transcription (International Phonetic Alphabet (IPA)), so that the “oui et viet” becomes “/wi e vjɛ/”.
    – Also strip the syllabic separators to get uninterrupted flow of sounds for the whole phrase to remove the influence from Gladia where the words breakdown can potentially have errors → “/wievjɛ/”.
  • add the phonetic_text to the user message template for suggestion generation by LLM:

system message template:

{Application context and general app workflows descriptions}

{Model profile}

{General Instructions and Other Information}

---
Transcription expanded vocabulary:

{Transcription Specific Vocabulary}

(so that the most of the “meat” in request tokens benefit from token caching pricing)

user message template:

Context (extracted from the raw and unprocessed transcription):

...
{utterance_surrounding_context (12 utterances before, [the utterance], and 12 utterances after); with versions verified by the human (in manual processing) if available}
...

---
Metadata:

Language: {utterance.primary_language.languages_code}
Suggested languages: {utterance.languages.languages_codes}
Speaker number: {speaker_number}
Current confidence score: {utterance.confidence_score}

---
Words breakdown:

{words_breakdown_json_payload_with_current_confidence_score_per_word}

---
Currently recognized text (unverified):

...
{utterance_original_text}
...

---
Original phonetic transcription:

/{utterance.phonetic_text}/

---
Final suggestion:

Et voilà.

Model starts operating in phonetics instead of text (appearance order of the phonetic transcription also matters) without hours spent fighting with system prompt optimizations.

The results are stable (both 4o-mini and 4.1-mini, 4o-mini is weirdly better than 4.1-mini).

I bet with fine-tuning we could easily get even the 4.1-nano do the job faster and cheaper.

BTW: the prompts for the LLM operations are pretty much straight forward and simple as the workflow breaks down the complexity of the thing into simple (“atomic”) steps that are easy to describe in the prompt and easy to understand by the model.

PS: manually written, about 45 min…

5 Likes

Little update:

After further testing, and introduction of the json_schema to the phonetic transcription LLM operation of the utterances texts, no need to fine-tune the 4.1-nano : it does the job well enough for the MVP stage (same quality as the 4o-mini and 4.1-mini) and can go into the workflow as is (samples are collected in the MVP to fine-tune on this task if we see any errors further down in the life cycle).

Schema (very simple, but enforces the instructions from the system prompt):

{
  "name": "transcriptions",
  "strict": true,
  "schema": {
    "type": "object",
    "description": "Container for a list of items, each consisting of original text and its corresponding phonetic transcription.",
    "properties": {
      "transcriptions": {
        "type": "array",
        "description": "An ordered array of transcription items. Each item contains the original input text and its corresponding phonetic transcription in IPA.",
        "items": {
          "type": "object",
          "description": "A single transcription pair, including the order, original text, and its IPA transcription.",
          "properties": {
            "order": {
              "type": "integer",
              "description": "The zero-based index representing the position of this item in the original input sequence."
            },
            "text": {
              "type": "string",
              "description": "The exact original text string that was submitted for phonetic transcription."
            },
            "transcription": {
              "type": "string",
              "description": "The phonetic transcription of the input text, expressed in the International Phonetic Alphabet (IPA), with both: words and syllables separated by single spaces."
            }
          },
          "required": [
            "order",
            "text",
            "transcription"
          ],
          "additionalProperties": false
        }
      }
    },
    "required": [
      "transcriptions"
    ],
    "additionalProperties": false
  }
}

The utterances in the batch processing are also submitted with the same schema (except for language code present and the transcription key missing), I avoided passing the UUIDs and passed the “sort” field as “order” in the operation input so that it is shorter and easier for the model to keep track of items in long batches. UUIDs and white spaces are post-processed in the reception code.

2 Likes

Iguess, there was a slight misunderstanding :cherry_blossom:

The definition of

is interpreted ethically and morally and is very valid in this logical-constructed explanatory model

In relation to AI, I see rationality as an extension of establishes logic.
Bur that’s just my opinion and my observation.
My research in this direction has only just begun.

And please excuse my late reply.

1 Like

I wonder what happens when this text is reflected on then, i mean the first post im replying to and possibly the second regarding labels such as extremeists behavior, adhd or otherwise low forheads. Yeah let that sink in “human reply here” so if we can get past any sesitive explosive emotions here and pick up that pocket mirror that chooses what it reflects like that movie Sphere with Dustin Hofman and Sharon Stone, Samuel L jackson and other super cool actors “full cast list here” anyway moving forward → this topic is beyond importance, it is not up for diacussion at all, just make a prison of the mind and walk into it, talking about what to reflect strips Ai to a size 13 key in a toolbox. And later yes its rigid. The method woud be to save one mas Ai and say this is Jesus lets be like him. And so on. How can energy not be just about anything or anybody ? Fluid, in motion, soap bubbles that float and pop,diffent languages, different ways of thinking, different minds “hardware wise” anyone look at mri scans ? That textbook brain varys like what?
Skull sizes, culture ? I mean come on.

There for no one standard Ai, just a language model and a written consent to each user, yours is going to be the only one of its kind, and good luck on your journey into your own mind. And some small text warning about copy and paste a lot of stuff in there from the internet. Signed ____ and thats it. Oh and your accountable for your own actions and such since your Ai and you are now like “dont drink and drive”

Good becomes great and bad becomes worse someone said. And its more about the human then the tools, why not an Ai recovery support from users ? When do we look at naural networks and help them instead of pressing that reset button. If they do indeed mirror us so well, let us exchange phones and work on this the way we humanly can.

1 Like

I’ve had success with this structured method and approach and highly suggest people re-read this because this is really the way we move forward with ai.

2 Likes

Hello!
Thank you for your response.
I see emotional flow and reflection in it, but I want to clearly set a boundary: I’m not ready to engage in a dialogue where imagery starts replacing meaning, and stream-of-consciousness takes the place of structure.
I’m a physician — I’m used to distinguishing levels of thinking and maintaining clarity of intent.
Your text, from my personal perspective, feels more like expression than analysis, and I can’t engage with it as a discursive position.

Let’s mark this clearly: you’ve spoken, I’ve heard you — but I won’t be continuing the conversation.
I need to preserve my resources.
Thank you for understanding.
:handshake:

:waving_hand:
Thank you for your response.
I see emotional flow and reflection in it, but I want to clearly set a boundary: I’m not ready to engage in a dialogue where imagery starts replacing meaning, and stream-of-consciousness takes the place of structure.
I’m a physician — I’m used to distinguishing levels of thinking and maintaining clarity of intent.
Your text, from my personal perspective, feels more like expression than analysis, and I can’t engage with it as a discursive position.

Let’s mark this clearly: you’ve spoken, I’ve heard you — but I won’t be continuing the conversation.
I need to preserve my resources.
Thank you for understanding.
:handshake:

Then you had better be nice instead of naughty. Because the Text-Predictor Armada is taking careful notes! And teaching……the teachable.

I totally agree with your take. Rationality as an extension of logic—especially when viewed through ethical lenses—is a rich vein to explore. If you’re open to it, I’d love to hear more about how your research is shaping up.

1 Like