Calling TTS from a Swift app

I saw that OpenAI published an endpoint for TextToSpeech, but I could only find a sample for Node.js and one for Python, using some installs on my Mac. Don’t want that :wink:

So, I wrote a bit of code in Swift that produces a file with the spoken text.
No guarantee!! But it works for me!

  • Please add your own error handling and remove the print statements!
  • Please give the file a name, now it is a generated name.
  • The file is a tmp file. It is probably an MP3 file or something.
  • Or use the stream to play the resulting speech in your app.
  • You can remove the organisation if you are a one-person-team.
import Foundation

class OpenAITTS {
    
    private enum constants {
        enum openAI {
            static let url = URL(string: "https://api.openai.com/v1/audio/speech")
            static let apiKey = "<your apiKey here>"
            static let organisation = "<your organisation ID here>"
        }
    }
    
    private var urlSession: URLSession = {
        let configuration = URLSessionConfiguration.default
        let session = URLSession(configuration: configuration)
        return session
    }()
    
    func speak(_ text: String) {
        guard let request = self.request(text) else {
            print("No request")
            return
        }
        self.send(request: request)
    }
    
    private func send(request: URLRequest) {
        
        let task = self.urlSession.downloadTask(with: request) { urlOrNil, responseOrNil, errorOrNil in
            if let errorOrNil {
                print(errorOrNil)
                return
            }

            if let response = responseOrNil as? HTTPURLResponse {
                print(response.statusCode)
            }
            
            guard let fileURL = urlOrNil else { return }

            do {
                let documentsURL = try
                    FileManager.default.url(for: .documentDirectory,
                                            in: .userDomainMask,
                                            appropriateFor: nil,
                                            create: false)
                let savedURL = documentsURL.appendingPathComponent(fileURL.lastPathComponent)
                print(savedURL)
                try FileManager.default.moveItem(at: fileURL, to: savedURL)
            } catch {
                print ("file error: \(error)")
            }
        }

        task.resume()
    }
    
    private func request(_ text: String) -> URLRequest? {
        guard let baseURL = Self.constants.openAI.url else {
            return nil
        }
        
        let request = NSMutableURLRequest(url: baseURL)
        let parameters: [String: Any] = [
            "model": "tts-1",
            "voice": "nova",
            "response_format": "mp3",
            "speed": "0.98",  // hidden feature in OpenAI TTS! Range: 0.25 - 4.0, Default 1.0
            "input": text
        ]
        
        request.addValue("Bearer \(Self.constants.openAI.apiKey)", forHTTPHeaderField: "Authorization")
        request.addValue(Self.constants.openAI.organisation, forHTTPHeaderField: "OpenAI-Organization") // Optional
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")

        request.httpMethod = "POST"
        
        if let jsonData = try? JSONSerialization.data(withJSONObject: parameters, options: .prettyPrinted) {
            request.httpBody = jsonData
        }
        
        return request as URLRequest
    }
}
2 Likes

oh nice find on the speed. I have to add that to my system.

1 Like

Hi Ben,

I found one more argument. I updated the code sample.

I do not see what was added haha, that or I had known it and didn’t recall it was not there to begin with :slight_smile:

“response_format": “mp3”,

ah, that is another good find. that one I had used before when playing with streaming the voice. I look forward to the cloning system down the road which will be fun.

Hi Ben,
Are there any other arguments/parameters?

not that I have seen yet, I pretty much just know what they had in the documents. only been using openai’s voice the last few months. pre that used a lot of other systems during testing. Elevenlabs, GTTS, EDGEtts, Open.ai, and a few other less real voices. Elevenlabs is my fav but $$$ to run on a real time bot. Openai is almost as good and once they get the voice clone tech going it will get better imo as we can than source other voice options to match the projects.

for free voice EDGETTS is the best. its processed local and almost instance and well not amazing its pretty much like google assistant/siri/alexa quality but lacks emotion on the same level as the the top ones for ranges ect.

Also with Elevenlabs they now are testing sounds generators which are pretty sweet. Openai I think will be coming out with something like that down the road as well to go with SORA system. Ai bubble is just starting :slight_smile:

Is it possible to get the status of the TTS generation once you call it? I find that when I call the TTS, it often takes 5 - 10 seconds before I get a response, depending on how long the text is. Is there a way to determine the time it will take to generate the audio or a way to get real-time updates on the generation? Appreciate any help

you can stream it, depending on your setup. the issue with the delay is the building of the file than playback, where stream would be direct to an output.

Openai supports it and I think Elevenlabs, but I have not played with it yet. EDGETTS is almost instant. well not a Ferrari it’s pretty good.

You could build a algorithm to predict the approx time based on number of words and some mapped timings.