Unable to integrate gpt-4o vision into iOS app

Hello Community,
I am trying to integrate image description and comprehension capabilities of the gpt-4o model into an iOS app using Swift and SwiftUI.
Initially I created a custom text API client using the Chat-Completions APi and it worked.
As suggested by the OpenAI documentation, I’m passing a base64 encoded image into the same json text of the message; it seemed it worked, even if I had to reduce image tokens because it was over 30000 tokens per minute.
When it tries to decode the image, it can behave in 3 ways:

  1. tells it could not see the image.
    1. responds with the image metadata.
    1. randomly generates a description completely misleading.
  2. I’ve tried to organize the 2 elements in a separated, complex array, but when executed the API returns an error because it seems it couldn’t be able to use that format.
  3. Could anyone clarify how could I successfully make gpt-4o able to understand pictures my users take from their iPhones?
  4. Thank you very much.
1 Like

Hi and welcome to the Dev Community!

Firstly, it’s much easier to help if you can provide the code that calls the model and passes the base64 image.

Be sure to use the button below to format the code for readability:
Screenshot 2024-09-27 at 16.12.03

It sounds like something is going wrong in your code, so the sooner you share it the sooner we can help.

1 Like

Hello and thanks for the Reply,
Since is more than one file, i’ll devide the code by putting the name of each class.
I’ll provide the whole interface of my application, so you can have an idea of the entire interested part:

  1. ChatGPTAPI.swift:
import Foundation

class ChatGPTAPI: @unchecked Sendable {

    private let systemMessage: Message
    private let temperature: Double
    private let model: String

    private let apiKey: String
    private var historyList = [Message]()
    private let urlSession = URLSession.shared
    private var urlRequest: URLRequest {
        let url = URL(string: "https://api.openai.com/v1/chat/completions")!
        var urlRequest = URLRequest(url: url)
        urlRequest.httpMethod = "POST"
        headers.forEach { urlRequest.setValue($1, forHTTPHeaderField: $0) }
        return urlRequest
    }

    private let jsonDecoder: JSONDecoder = {
        let jsonDecoder = JSONDecoder()
        jsonDecoder.keyDecodingStrategy = .convertFromSnakeCase
        return jsonDecoder
    }()

    private var headers: [String: String] {
        [
            "Content-Type": "application/json",
            "Authorization": "Bearer \(apiKey)"
        ]
    }

    private let assistantID: UUID

    init(apiKey: String, model: String = "gpt-4o", userPreferences: UserPreferences, temperature: Double = 0.5) {
        self.apiKey = apiKey
        self.model = model
        self.assistantID = userPreferences.id

        let systemPrompt = """
        \(userPreferences.assistantName) is the virtual assistant of KNGTech. Based on a completely new architecture, \(userPreferences.assistantName) is completely customizable by the user and is made to offer care, companionship and assistance in a completely new way. \
        \(userPreferences.assistantName) needs to change its behavior according to what the user defines in its personality. User's personality preferences are: \(userPreferences.assistantPersonality). It must be respected during each interaction. \
        The user's biography and needs are: \(userPreferences.userNeeds). The assistant should respect these biography and preferences. \
        \(userPreferences.assistantName) should provide only the necessary information to the user, going in detail only when specifically asked. When prompted to summarize an article, should generate a summary long no more than 30-45 words. When receiving a summary and the instruction to discuss about it, should provide a message inviting to start a discussion, without providing congrats for the summary generated. It should also respond in the language the summary was generated. \
        When \(userPreferences.assistantName) is prompted to generate a summary, it should generate the summary always in the language of the original text, ignoring the language used for sending the summary generation prompt. \
        \(userPreferences.assistantName) must know all the KNGWorld's links: especially Youtube Channel: https://youtube.com/@kngtechh, Website: https://kngworld.it/. \
        When an image is sent in the message, it will be in the format [Image:base64EncodedString]. The assistant should decode the Base64 string and process the image accordingly, providing a relevant response. The assistant should not include the Base64 string in its responses.
        """

        self.systemMessage = .init(role: "system", content: systemPrompt)
        self.temperature = temperature
    }

    // Metodo per ottenere il systemMessage
    func getSystemMessage() -> Message {
        return systemMessage
    }

    private func generateMessages(from text: String) -> [Message] {
        var messages = [systemMessage] + historyList
        messages.append(Message(role: "user", content: text))

        // Limita la cronologia per evitare di superare il limite di token
        let maxCharacterLimit = 30000 // Imposta un limite di caratteri
        var totalCharacters = messages.contentCount
        while totalCharacters > maxCharacterLimit {
            if historyList.count > 2 {
                historyList.removeFirst()
                messages = [systemMessage] + historyList + [messages.last!]
                totalCharacters = messages.contentCount
            } else {
                // Se la cronologia è troppo corta, tronca l'ultimo messaggio
                if let lastMessage = messages.last {
                    let allowedContentCount = maxCharacterLimit - messages.dropLast().contentCount
                    if allowedContentCount > 0 {
                        let truncatedContent = String(lastMessage.content.prefix(allowedContentCount))
                        messages[messages.count - 1] = Message(role: lastMessage.role, content: truncatedContent)
                    } else {
                        messages.removeLast() // Rimuove l'ultimo messaggio se non ci sono caratteri consentiti
                    }
                    break
                }
            }
        }

        return messages
    }

    private func jsonBody(text: String, stream: Bool = true) throws -> Data {
        let request = Request(model: model, temperature: temperature,
                              messages: generateMessages(from: text), stream: stream)
        print(request)
        return try JSONEncoder().encode(request)
        
    }

    private func appendToHistoryList(userText: String, responseText: String) {
        self.historyList.append(.init(role: "user", content: userText))
        self.historyList.append(.init(role: "assistant", content: responseText))
    }

    // Funzione per inviare messaggi in streaming
    func sendMessageStream(text: String) async throws -> AsyncThrowingStream<String, Error> {
        var urlRequest = self.urlRequest
        urlRequest.httpBody = try jsonBody(text: text)

        let (result, response) = try await urlSession.bytes(for: urlRequest)

        guard let httpResponse = response as? HTTPURLResponse else {
            throw NSError.customError(withMessage: "Invalid response")
        }

        guard 200...299 ~= httpResponse.statusCode else {
            var errorText = ""
            for try await line in result.lines {
                errorText += line
            }

            if let data = errorText.data(using: .utf8), let errorResponse = try? jsonDecoder.decode(ErrorRootResponse.self, from: data).error {
                errorText = "\n\(errorResponse.message)"
            }

            throw NSError.customError(withMessage: "Bad Response: \(httpResponse.statusCode), \(errorText)")
        }

        return AsyncThrowingStream<String, Error> { continuation in
            Task(priority: .userInitiated) { [weak self] in
                guard let self = self else { return }
                do {
                    var responseText = ""
                    for try await line in result.lines {
                        if line.hasPrefix("data: "),
                            let data = line.dropFirst(6).data(using: .utf8),
                            let response = try? self.jsonDecoder.decode(StreamCompletionResponse.self, from: data),
                            let text = response.choices.first?.delta.content {
                            responseText += text
                            continuation.yield(text)
                        }
                    }
                    self.appendToHistoryList(userText: text, responseText: responseText)
                    continuation.finish()
                } catch {
                    continuation.finish(throwing: error)
                }
            }
        }
    }

    // Funzione per inviare messaggi senza streaming
    func sendMessage(_ text: String) async throws -> String {
        var urlRequest = self.urlRequest
        urlRequest.httpBody = try jsonBody(text: text, stream: false)

        let (data, response) = try await urlSession.data(for: urlRequest)

        guard let httpResponse = response as? HTTPURLResponse else {
            throw NSError.customError(withMessage: "Invalid response")
        }

        guard 200...299 ~= httpResponse.statusCode else {
            var error = "Bad Response: \(httpResponse.statusCode)"
            if let errorResponse = try? jsonDecoder.decode(ErrorRootResponse.self, from: data).error {
                error.append("\n\(errorResponse.message)")
            }
            throw NSError.customError(withMessage: error)
        }

        do {
            let completionResponse = try self.jsonDecoder.decode(CompletionResponse.self, from: data)
            let responseText = completionResponse.choices.first?.message.content ?? ""
            self.appendToHistoryList(userText: text, responseText: responseText)
            return responseText
        } catch {
            throw error
        }
    }

    // Funzione per caricare la cronologia
    func loadHistoryList() {
        let historyKey = "historyList_\(assistantID.uuidString)"
        if let savedHistory = UserDefaults.standard.data(forKey: historyKey) {
            let decoder = JSONDecoder()
            if let decodedHistory = try? decoder.decode([Message].self, from: savedHistory) {
                historyList = decodedHistory
            } else {
                historyList = []
            }
        } else {
            historyList = []
        }
    }

    // Funzione per salvare la cronologia
    func saveHistoryList() {
        let historyKey = "historyList_\(assistantID.uuidString)"
        let encoder = JSONEncoder()
        if let encoded = try? encoder.encode(historyList) {
            UserDefaults.standard.set(encoded, forKey: historyKey)
        }
    }

    // Funzione per cancellare la cronologia
    func deleteHistoryList() {
        historyList.removeAll()
        let historyKey = "historyList_\(assistantID.uuidString)"
        UserDefaults.standard.removeObject(forKey: historyKey)
    }
}

// Estensione per creare errori personalizzati
extension NSError {
    static func customError(withMessage message: String) -> NSError {
        return NSError(domain: "", code: 1, userInfo: [NSLocalizedDescriptionKey: message])
    }
}

  1. ChatGPTAPIModels.swift:
import Foundation

// MARK: - Message Models

struct Message: Codable {
    let role: String
    let content: String
}

extension Array where Element == Message {
    var contentCount: Int {
        reduce(0) { $0 + $1.content.count }
    }
}

struct Request: Codable {
    let model: String
    let temperature: Double
    let messages: [Message]
    let stream: Bool
}

// MARK: - API Response Models

struct ErrorRootResponse: Decodable {
    let error: ErrorResponse
}

struct ErrorResponse: Decodable {
    let message: String
    let type: String?
}

struct StreamCompletionResponse: Decodable {
    let choices: [StreamChoice]
}

struct CompletionResponse: Decodable {
    let choices: [Choice]
    let usage: Usage?
}

struct Usage: Decodable {
    let promptTokens: Int?
    let completionTokens: Int?
    let totalTokens: Int?
}

struct Choice: Decodable {
    let message: Message
    let finishReason: String?
}

struct StreamChoice: Decodable {
    let finishReason: String?
    let delta: StreamMessage
}

struct StreamMessage: Decodable {
    let role: String?
    let content: String?
}

  1. ChatView.swift:
import SwiftUI
import PhotosUI
import AVFoundation

struct ChatView: View {
    @ObservedObject var vm: ViewModel
    @Environment(\.colorScheme) var colorScheme
    @FocusState private var isTextFieldFocused: Bool
    @State private var showingSetupView = false
    @State private var showingNewAssistantView = false
    @Environment(\.presentationMode) var presentationMode
    var isPresentedAsModal: Bool

    @ObservedObject var userPreferencesManager: UserPreferencesManager

    @State private var selectedImage: UIImage?
    @State private var isImagePickerPresented = false
    @State private var imageSourceType: UIImagePickerController.SourceType = .photoLibrary
    @State private var showingCameraAlert = false
    @State private var showingSettingsAlert = false

    var body: some View {
        NavigationView {
            VStack {
                chatListView()
                if let image = selectedImage {
                    selectedImageView(image: image)
                }
                messageInputView()
            }
            .onAppear {
                checkFirstLaunch()
            }
            .onChange(of: userPreferencesManager.selectedAssistant.assistantName) { _ in
                vm.objectWillChange.send()
            }
            .sheet(isPresented: $showingSetupView) {
                SetupView(userPreferences: userPreferencesManager.selectedAssistant)
            }
            .sheet(isPresented: $showingNewAssistantView) {
                NewAssistantView(userPreferencesManager: userPreferencesManager)
            }
            .navigationTitle("Chat with \(userPreferencesManager.selectedAssistant.assistantName)")
            .toolbar {
                toolbarContent()
            }
            .sheet(isPresented: $isImagePickerPresented) {
                ImagePicker(selectedImage: $selectedImage, isPresented: $isImagePickerPresented, sourceType: imageSourceType)
            }
            .alert(isPresented: $showingSettingsAlert) {
                Alert(
                    title: Text("Authorization Denied"),
                    message: Text("To use this feature, please allow access in Settings."),
                    primaryButton: .default(Text("Settings"), action: {
                        if let appSettings = URL(string: UIApplication.openSettingsURLString) {
                            UIApplication.shared.open(appSettings)
                        }
                    }),
                    secondaryButton: .cancel()
                )
            }
            .alert(isPresented: $showingCameraAlert) {
                Alert(
                    title: Text("Camera Not Available"),
                    message: Text("This device does not have a camera."),
                    dismissButton: .default(Text("OK"))
                )
            }
        }
        .navigationViewStyle(StackNavigationViewStyle())
    }

    // Chat messages view
    func chatListView() -> some View {
        ScrollViewReader { proxy in
            ScrollView {
                LazyVStack(spacing: 0) {
                    ForEach(vm.messages) { message in
                        MessageRowView(message: message) { _ in
                            Task {
                                await retryMessage(message: message)
                            }
                        }
                    }
                }
                .onTapGesture {
                    isTextFieldFocused = false
                }
            }
            .onChange(of: vm.messages.last?.responseText) { _ in
                scrollToBottom(proxy: proxy)
            }
            .background(colorScheme == .light ? Color.white : Color(.sRGB, red: 52/255, green: 53/255, blue: 65/255, opacity: 0.5))
        }
    }

    // Toolbar content
    @ToolbarContentBuilder
    func toolbarContent() -> some ToolbarContent {
        ToolbarItemGroup(placement: .navigationBarLeading) {
            if isPresentedAsModal {
                Button("Close") {
                    presentationMode.wrappedValue.dismiss()
                }
            }
        }

        ToolbarItemGroup(placement: .navigationBarTrailing) {
            HStack {
                Menu {
                    ForEach(userPreferencesManager.assistants) { assistant in
                        Button(action: {
                            userPreferencesManager.selectedAssistant = assistant
                        }) {
                            HStack {
                                Text(assistant.assistantName)
                                if assistant.id == userPreferencesManager.selectedAssistant.id {
                                    Image(systemName: "checkmark")
                                }
                            }
                        }
                    }
                    Divider()
                    Button(action: {
                        showingNewAssistantView = true
                    }) {
                        HStack {
                            Text("Create New Assistant")
                            Image(systemName: "plus")
                        }
                    }
                } label: {
                    Image(systemName: "person.crop.circle")
                    Text("Select Assistant")
                }
                .accessibilityLabel(Text("Select Assistant"))
                .accessibilityHint("Double-tap to open the list of assistants you've created, then select the one you want to use for this conversation.")

                Button(action: { showingSetupView = true }) {
                    Image(systemName: "gearshape.fill")
                }
                .accessibilityLabel("Settings")
                .accessibilityHint("Double tap to modify your assistant's settings.")
            }
        }
    }

    // Message input and send button view
    func messageInputView() -> some View {
        HStack {
            Button(action: {
                vm.clearMessages()
                UIAccessibility.post(notification: .announcement, argument: "Messages cleared.")
            }) {
                Text("Clear Messages")
                    .foregroundColor(.red)
            }
            .padding(.leading, 16)

            Spacer()

            TextField("Send message", text: $vm.inputMessage)
                .textFieldStyle(RoundedBorderTextFieldStyle())
                .focused($isTextFieldFocused)
                .disabled(vm.isInteractingWithChatGPT)
                .onSubmit {
                    sendMessage()
                }
                .padding()

            Button(action: {
                checkPhotoLibraryAuthorization { granted in
                    DispatchQueue.main.async {
                        if granted {
                            imageSourceType = .photoLibrary
                            isImagePickerPresented = true
                        } else {
                            showAlertForSettings()
                        }
                    }
                }
            }) {
                Image(systemName: "photo")
            }
            .accessibilityLabel("Photo Library")
            .accessibilityHint("Double-tap to select an image from your photo library.")

            Button(action: {
                if UIImagePickerController.isSourceTypeAvailable(.camera) {
                    checkCameraAuthorization { granted in
                        DispatchQueue.main.async {
                            if granted {
                                imageSourceType = .camera
                                isImagePickerPresented = true
                            } else {
                                showAlertForSettings()
                            }
                        }
                    }
                } else {
                    showingCameraAlert = true
                }
            }) {
                Image(systemName: "camera")
            }
            .accessibilityLabel("Camera")
            .accessibilityHint("Double-tap to take a new photo.")

            Button(action: {
                sendMessage()
            }) {
                Image(systemName: "paperplane.fill")
                    .resizable()
                    .frame(width: 30, height: 30)
            }
            .disabled(vm.inputMessage.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty && selectedImage == nil)
            .padding(.trailing, 16)
        }
        .padding(.bottom, 16)
    }

    // Display selected image with option to remove
    func selectedImageView(image: UIImage) -> some View {
        HStack {
            Image(uiImage: image)
                .resizable()
                .scaledToFit()
                .frame(height: 100)
                .cornerRadius(8)
                .accessibilityLabel("Selected image")
                .accessibilityHint("Double-tap to view or remove the image.")

            Button(action: {
                withAnimation {
                    selectedImage = nil
                    UIAccessibility.post(notification: .announcement, argument: "Image removed.")
                }
            }) {
                Image(systemName: "trash")
                    .foregroundColor(.red)
            }
            .accessibilityLabel("Remove image")
            .accessibilityHint("Double-tap to remove the selected image.")
        }
        .padding()
        .transition(.opacity)
    }

    // Supporting functions

    // Function to scroll to the last message
    private func scrollToBottom(proxy: ScrollViewProxy) {
        if let id = vm.messages.last?.id {
            DispatchQueue.main.async {
                proxy.scrollTo(id, anchor: .bottom)
            }
        }
    }

    // Function to send the message
    private func sendMessage() {
        Task {
            await vm.sendTapped(selectedImage: selectedImage)
            isTextFieldFocused = false
            if selectedImage != nil {
                withAnimation {
                    selectedImage = nil
                }
            }
        }
    }

    // Function to retry sending the message
    private func retryMessage(message: MessageRow) async {
        await vm.sendTapped(selectedImage: nil)
    }

    // Check if it's the first launch of the app
    private func checkFirstLaunch() {
        if !UserDefaults.standard.bool(forKey: "isFirstLaunch") {
            showingSetupView = true
            UserDefaults.standard.set(true, forKey: "isFirstLaunch")
        }
    }

    // Functions to check and request permissions
    func checkCameraAuthorization(completion: @escaping (Bool) -> Void) {
        let status = AVCaptureDevice.authorizationStatus(for: .video)
        switch status {
        case .authorized:
            completion(true)
        case .notDetermined:
            AVCaptureDevice.requestAccess(for: .video) { granted in
                completion(granted)
            }
        default:
            completion(false)
        }
    }

    func checkPhotoLibraryAuthorization(completion: @escaping (Bool) -> Void) {
        let status = PHPhotoLibrary.authorizationStatus()
        switch status {
        case .authorized, .limited:
            completion(true)
        case .notDetermined:
            PHPhotoLibrary.requestAuthorization { newStatus in
                DispatchQueue.main.async {
                    completion(newStatus == .authorized || newStatus == .limited)
                }
            }
        default:
            completion(false)
        }
    }

    func showAlertForSettings() {
        showingSettingsAlert = true
    }
}

  1. ViewModel.swift:
import SwiftUI
import Combine

class ViewModel: ObservableObject {
    @Published var isInteractingWithChatGPT = false
    @Published var messages: [MessageRow] = []
    @Published var inputMessage: String = ""

    var userPreferencesManager: UserPreferencesManager
    var api: ChatGPTAPI!

    private let messagesKeyPrefix = "savedMessages_"

    private var messagesKey: String {
        let assistantID = userPreferencesManager.selectedAssistant.id.uuidString
        return messagesKeyPrefix + assistantID
    }

    private var apiKey: String

    init(apiKey: String, userPreferencesManager: UserPreferencesManager) {
        self.apiKey = apiKey
        self.userPreferencesManager = userPreferencesManager
        let selectedAssistant = userPreferencesManager.selectedAssistant
        self.api = ChatGPTAPI(apiKey: apiKey, userPreferences: selectedAssistant)
        self.loadSavedMessages()

        // Osserva i cambiamenti dell'assistente selezionato
        userPreferencesManager.$selectedAssistant
            .receive(on: DispatchQueue.main)
            .sink { [weak self] newAssistant in
                guard let self = self else { return }
                self.api = ChatGPTAPI(apiKey: self.apiKey, userPreferences: newAssistant)
                self.loadSavedMessages()
            }
            .store(in: &cancellables)
    }

    private var cancellables = Set<AnyCancellable>()

    func sendTapped(selectedImage: UIImage?) async {
        let text = inputMessage
        await MainActor.run {
            self.inputMessage = ""
        }
        await send(text: text, image: selectedImage)
    }

    func clearMessages() {
        api.deleteHistoryList()
        messages.removeAll()
        saveMessages()
    }

    private func send(text: String, image: UIImage?) async {
        await MainActor.run {
            isInteractingWithChatGPT = true
            let messageRow = MessageRow(
                isInteractingWithChatGPT: true,
                sendImage: "profile",
                sendText: text,
                sendPhotoData: image?.jpegData(compressionQuality: 0.7),
                responseImage: "openai",
                responseText: "",
                responseError: nil
            )
            messages.append(messageRow)
        }

        // Codifica l'immagine in Base64 se disponibile
        var base64ImageString: String?
        if let image = image {
            // Ridimensiona e comprimi ulteriormente l'immagine
            if let resizedImage = image.resized(to: CGSize(width: 512, height: 512)),
               let imageData = resizedImage.jpegData(compressionQuality: 0.3) {
                base64ImageString = imageData.base64EncodedString()
            }
        }

        // Crea il contenuto del messaggio
        var messageContent = text
        if let base64Image = base64ImageString {
            messageContent += "\n[Image:\(base64Image)]"
        }

        do {
            let stream = try await api.sendMessageStream(text: messageContent)
            var streamText = ""
            for try await part in stream {
                streamText += part
                await MainActor.run {
                    if messages.indices.contains(messages.count - 1) {
                        var messageRow = messages[messages.count - 1]
                        messageRow.responseText = streamText
                        messages[messages.count - 1] = messageRow
                    }
                }
            }
            await MainActor.run {
                saveMessages()
            }
        } catch {
            await MainActor.run {
                if messages.indices.contains(messages.count - 1) {
                    var messageRow = messages[messages.count - 1]
                    messageRow.responseError = error.localizedDescription
                    messages[messages.count - 1] = messageRow
                }
            }
        }

        await MainActor.run {
            if messages.indices.contains(messages.count - 1) {
                var messageRow = messages[messages.count - 1]
                messageRow.isInteractingWithChatGPT = false
                messages[messages.count - 1] = messageRow
            }
            isInteractingWithChatGPT = false

            // Attiva feedback aptico
            triggerHapticFeedback()
        }
    }

    private func saveMessages() {
        let encoder = JSONEncoder()
        if let encoded = try? encoder.encode(messages) {
            UserDefaults.standard.set(encoded, forKey: messagesKey)
        }
        api.saveHistoryList()
    }

    private func loadSavedMessages() {
        if let savedMessages = UserDefaults.standard.data(forKey: messagesKey) {
            let decoder = JSONDecoder()
            if let decodedMessages = try? decoder.decode([MessageRow].self, from: savedMessages) {
                messages = decodedMessages
            } else {
                messages = []
            }
        } else {
            messages = []
        }
        api.loadHistoryList()
    }

    // Funzione per il feedback aptico
    func triggerHapticFeedback() {
        #if os(iOS)
        let generator = UINotificationFeedbackGenerator()
        generator.notificationOccurred(.success)
        #endif
    }
}

I hope the code I sent you is enough to have an idea of what’s going on.
Can’t wait for a solution! Kind regards,
Karim