Streaming Responses - Exploring Cost-Efficient Alternatives to SSE with AWS Lambda & API Gateway

Hello everyone,

I’m currently working on a project where I want to stream real-time responses, with a preference towards using Server-Sent Events (SSE). Websockets aren’t an option for my use case.

I’ve been exploring AWS Lambda combined with API Gateway, but from my research, it seems they don’t support SSE (unless something has changed recently).

I prefer using lambdas, because I’m going to use this app internally with the following assumption:

  • 10 people using the application.
  • They will make 100 conversations per day.
  • Conversations take 40 seconds to complete.

Main questions:

  1. Can anyone confirm if, as of today, AWS Lambda and API Gateway still don’t have built-in support for SSE?
  2. What cost-efficient alternatives might be available for achieving this?

Alternatives I’ve explored:

  • Using AWS Lambda with polling / long polling - does not provide as real-time experience as SSE and is not that cost-effective.
  • Implementing SSE with Amazon EC2/ECS/Fargate - is not cost-efficient when compared to lambdas.
  • Utilizing Cloudflare Workers as an intermediary to handle SSE - that seems to be nice alternative, but I want to avoid using services outside AWS solutions.
  • AWS AppSync with a real-time data approach - don’t have experience with this solution. In theory it can provide SSE for lambdas, but I would like to confirm it will indeed work as expected.

I’m open to exploring other solutions if they align with the goal of cost-effective, real-time streaming. If anyone has tried a different approach or is aware of another solution, please share!

Thank you!

Hi and welcome to the developer forum!

Can I ask why you’re looking at AWS for an internal project? If you were to use a VPS or even an internal linux box, you would have full access to SSE’s and any other niceties that are often restricted on AWS and GC

Have you thought about using DynamoDB for this?

So the event goes to your lambda through API Gateway asynchronously. Then your lambda does some stuff, then writes an entry into a DynamoDB table. Then another lambda receives all new entries into this table as an input, and can then respond back.

Does something like that work for you?

Also, lambdas have native timeouts of 15 minutes max. So just sitting on the lambda might be viable for you too (but your gateway times out after 30 seconds, so you still need to call it asynchronously with your 40 second wait.). This is an “anti-pattern” but there is no difference if you have to wait synchronously for an OpenAI API call to return.

Hello Foxabilo, and thanks for your suggestion!

The primary reason I’m keen on using AWS for this project is that I aim to package and sell this product to other small teams. The idea is to provide an automated way to install it on their AWS accounts, hence the focus on this platform. I understand the flexibility that comes with using a VPS or an internal Linux box. Still, the goal is to make the setup seamless for potential customers who may already be on the AWS ecosystem.

Hi @curt.kennedy,

Thank you for your insights!

I did consider the approach of using DynamoDB in tandem with Lambda functions. However, my primary goal is to offer a real-time experience for users when streaming responses. The method you proposed is somewhat akin to a more scalable version of long polling.

I’ve visualized the streaming challenge with AWS Lambda and API Gateway using a Mermaid diagram:

sequenceDiagram
    participant Client
    participant API Gateway
    participant Lambda
    participant OpenAI API

    Client->>API Gateway: Sends Request for Chat Response
    API Gateway->>Lambda: Forwards Request
    Lambda->>OpenAI API: Begins Chat Interaction
    OpenAI API->>Lambda: Starts sending streaming response
    Lambda-->>API Gateway: Tries to stream SSE
    Note over API Gateway: Waits for complete response (No SSE support)
    API Gateway-->>Client: Sends full response after Lambda completes
    Note over Client: Real-time streaming feel is lost

The main bottleneck arises from API Gateway’s inability to support Server-Sent Events (SSE) effectively.

Regarding your suggestion of using DynamoDB:

  1. Lambda & DynamoDB Interaction: Your approach of using DynamoDB as an intermediate layer does seem feasible. I’m considering an asynchronous method where results are stored in DynamoDB, and then another Lambda fetches and serves these results.
  2. Lambda Timeouts and Strategy: While you’re right about the native Lambda timeout being up to 15 minutes, the challenge I face is more with the API Gateway’s 30-second timeout. To tackle this, I’ve thought of a strategy where, as I approach the 30-second limit, I could invoke another Lambda call to continue the work. This would essentially break down longer tasks into 30-second chunks, each handled by separate Lambda invocations.
  3. Real-Time Experience & Cost Considerations: My primary goal is to ensure a near real-time experience for users. Polling DynamoDB might introduce a slight delay compared to directly streaming responses using SSE. On the cost side, based on assumptions:
  • 10 people using the application.
  • 100 conversations per day per person.
  • Conversations average 40 seconds.A rough estimate indicates SSE would cost around $0.09/month, whereas the DynamoDB method would hover around $1.625/month. SSE seems more cost-effective, especially when anticipating increased traffic.

Considering all this, do you believe there’s a way to lessen latency with the DynamoDB approach? I’m keen on hearing more suggestions or optimizations for this model, especially keeping cost-effectiveness in mind.

Thanks again for your valuable input!

2 Likes

You could easily make a desktop or intranet application in e.g. Python using existing “iron” with this low performance requirements.

I put my own JavaScript-based chat client on a DigitalOcean droplet that already runs numerous other applications. As the client hardly does anything except UI, as well as runs completely in the browser not loading the server at all, even this is way overpowered (like “10000” times).

Sure we could @redsmurph however we’re building a web application.

When in doubt, ask GPT-4 (not sure I follow the answer, does it make sense?):

AWS API Gateway didn’t traditionally support Server-Sent Events (SSE) natively, but with the introduction of HTTP APIs (a new type of API in API Gateway), real-time, server-sent events are more feasible. Below is a step-by-step process on how you could configure your HTTP API to make it work with SSE.

Remember that this might not work smoothly, given limitations in the API Gateway itself (like a 29-second maximum connection idle time). Other AWS services like App Runner or deploying on EC2 directly might be more suited for SSE.

Pre-requisites:

  1. Basic understanding of AWS services, in particular API Gateway, Lambda and DynamoDB.

  2. Having AWS account to create and manage resources.

  3. AWS CLI installed and configured with necessary permissions.

  4. Knowledge on JavaScript and Node.js (for Lambda function).

  5. Create a new HTTP API in API Gateway.

  6. Create a new Lambda function.

    You can use the blueprint in the AWS Lambda console to create a new Node.js function named ssePublisher.

    Sample Code:

    exports.handler = async (event) => {
    const id = Date.now();
    
    // Here, messages could be sourced from a DynamoDB stream, Kinesis, or any other event source
    const messages = ['message 1', 'message 2'];
    
    const response = {
        'statusCode': 200,
        'headers': {
           'Content-Type': 'text/event-stream',
           'Cache-Control': 'no-cache',
           'Connection': 'keep-alive',
           'Access-Control-Allow-Origin': '*',
           'Access-Control-Allow-Headers': 'Origin, X-Requested-With, Content-Type, Accept'
        },
        'isBase64Encoded': false,
        'multiValueHeaders': {
           'Connection': ['keep-alive'],
           'Cache-Control': ['no-cache'],
           'Content-Type': ['text/event-stream'],
           'Access-Control-Allow-Origin': ['*'],
        },
        'body': messages.map((message) => `id: ${id}\ndata: ${message}\n\n`).join('\n')
    };
    return response;
    };
    
  7. Connect your API Gateway to your Lambda function.

    Create a new route in your HTTP API, set its path (for example, /stream), and set its integration target to be your ssePublisher Lambda function.

  8. Deploy your API.

    You can use an automated CI/CD pipeline or manually deploy from the console.

  9. Connect to your SSE API.

    Use an SSE-capable HTTP client (like a modern web browser’s Fetch API) to connect to your API + route (for example, https://<your-api-id>.execute-api.<region>.amazonaws.com/stream), and listen to the Server-Sent Events.

Remember that SSE aren’t intended to be kept open indefinitely - API Gateway can and will close idle connections, so you have to handle re-connecting from the client side.

This set-up is basic and lacks features you would need for a production-ready SSE system such as connection management, handling large volumes of data, multiple consumers, filtering events, among others.

Scratch the GPT-4 answer above, I think you need to check out the new-ish Lambda function URLs

You bypass API Gateway, and get full 15 minutes in the lambda, skipping the 30s timeout in the gateway.

2 Likes

And also use the new AWS Lambda response streaming feature

2 Likes

Wow, thanks for the quick and insightful feedback!

@curt.kennedy: About the GPT-4 SSE route, I’ve previously explored that avenue. I suspect the detailed response about setting up SSE with HTTP API & Lambda is a sort of “hallucination” from GPT.

Now, your mention of the Lambda function URLs? That’s pure gold! Bypassing the API Gateway and getting that juicy 15 minutes in Lambda? That might just be the game changer I was looking for. I’m super eager to dive into this and see it in action. Will definitely keep you all posted on the results.

@henrysibanda: The AWS Lambda response streaming feature gives me even more details. Much appreciated!

I’m genuinely appreciative of the collective expertise here. As I venture into these suggested solutions, I’ll keep everyone updated.

Thanks you!

3 Likes

I’ve tested the suggestions regarding bypassing the API Gateway and using Lambda function URLs, along with the AWS Lambda streaming feature. I can confirm that this approach effectively addresses the issues I was facing!

Thank you, @curt.kennedy and @henrysibanda, for pointing me in the right direction! I appreciate the insights and guidance from this forum!

Marking @curt.kennedy’s answer as a solution, but @henrysibanda’s link is important too!

2 Likes

Otherwise you might have just used SQS and send tiny little messages :smiley:

I’m working with the Lambda function URLs solution suggested by @curt.kennedy for my own use case, but the function output isn’t what I’m expecting. @dmiskiew, did you get your solution to a point (or was it part of your use case) where the Lambda response stream chunks were identical to the stream given by the OpenAI API?

My issue is that, when calling the Lambda function URL, I’m getting chucks that are made up of multiple OpenAI response chunks (and often times one OpenAI chunk is split across two Lambda response chunks making it impossible to parse the data from my client). For example, here are some chunks I’m receiving from my Lambda function response:

Chunk #1:

{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"A"},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"ye"},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":","},"finish_reason":null}]}{"id":"chat

Chunk #2

cmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" I"},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" have"},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" come"},"finish_reason":null}]}{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" across"}

… etc. until the whole response is returned. When I’m looking for individual chunked objects like what OpenAI returns, e.g:

Chunk #1

{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"A"},"finish_reason":null}]}

Chunk #2

{"id":"chatcmpl-8SY6q514tz0hzLJids92OS7Lv8Q0E","object":"chat.completion.chunk","created":1701814992,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"ye"},"finish_reason":null}]}

Here is my Lambda function code:

import OpenAI from 'openai';
import util from 'util';
import stream from 'stream';

const { Readable, Transform } = stream;
const pipeline = util.promisify(stream.pipeline);

/* global awslambda */
export const handler = awslambda.streamifyResponse(
  async (event, responseStream, _context) => {
      const body = JSON.parse(event.body);
      
      const openai = new OpenAI({
        apiKey: 'xxx',
      });

      const response = await openai.chat.completions.create({
        model: 'gpt-3.5-turbo',
        messages: body.messages,
        stream: true,
      });

      const requestStream = Readable.from(stream);

      // safely pipe the OpenAI stream to the function response stream
      await pipeline(requestStream, responseStream);
  }
);

etc. Haven’t been able to figure this one out, but I’m new to streaming with Node, so maybe there’s something obvious I’m missing.

The docs say you should use their libraries, and it looks like your implementation might be different than the docs (ref).

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Wrapped into the SDK’s is the “Server-sent events” standard.

2 Likes

@curt.kennedy Thank you for providing that doc reference! I’ve changed my implementation to use the OpenAI-recommended way of looping through the chunks using for…await. For sending the response while processing each chunk, I’m following the AWS docs here:

https://docs.aws.amazon.com/lambda/latest/dg/configuration-response-streaming.html
import OpenAI from 'openai';

/* global awslambda */
export const handler = awslambda.streamifyResponse(
  async (event, responseStream, _context) => {
    const body = JSON.parse(event.body);

    const openai = new OpenAI({
      apiKey: 'xxx',
    });

    const requestStream = await openai.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: body.messages,
      stream: true,
    });

    for await (const chunk of requestStream) {
      console.log(chunk);
      
      responseStream.write(chunk);
    }

    responseStream.end();
  }
);

The odd part is, logging each chunk in the for loop works as expected (one chunk per loop iteration), but the Lambda’s response is still streaming multiple/partial grouped chunks :thinking: This might belong on an AWS forum after all.

EDIT

The writeable Node stream that the OpenAI SDK creates is in objectMode. However, it turns out that the response stream initialized by Lambda is not; I believe that OpenAI’s chunks are being converted into strings, and with each iteration of the loop, a chunk was being appended to the response string making it seem as though multiple chunks were being returned at once.

I think you might want to have a look into AWS App Runner. We use this service a lot for similar use cases, really easy to deploy and very low cost.