OpenAI’s GPT-4o Model Revolutionizes Multi-Modal AI Interactions with Sub-Second Response Times

Rahul Somvanshi

OpenAI's GPT-4O

OpenAI has announced its new flagship model, “GPT-4o (Four-O).” This model seamlessly handles text, voice, and images with fast response times and will be made available to all ChatGPT users in the coming weeks. It is characterized by its fast response times and will initially be offered to paid ChatGPT Plus and Team users, followed by expansion to Enterprise users.

The “o” in OpenAI’s GPT-4o stands for “omni.” This model can handle text, voice, and images seamlessly, and it has significantly improved understanding of images and voice, naturally integrating these elements within a single conversation. In the demo, centered around voice conversation, GPT-4o showcased its diverse capabilities. For example, it understands human breathing sounds and facial expressions, interpreting their meaning. When told “I’m nervous, teach me how to relax! Ha-ha-ha (breathing sound),” it responded humorously, “You’re not a vacuum cleaner!” understanding that the heavy breathing sound was rough. It then guided a relaxation breathing technique: “Take a slow, deep breath… now breathe in… and out…”

Additionally, when shown a mathematical equation in an image, OpenAI’s GPT-4O reads and understands it. If asked, “Don’t just give me the answer to x, help me solve it step by step,” it indeed provides hints accordingly. It also naturally understands an image with “I (heart) GPT” as “I Love GPT.” Furthermore, it can read human facial expressions to infer emotions. It also adjusts its voice tone as instructed, whether it’s emotional, robotic, or dramatically robotic.

With latency very close to human-like, GPT-4o’s speed in responding to voice is notable, contributing to its naturalness. According to OpenAI, it can respond to voice within an average of 320 milliseconds, very close to human reaction speed. Real-time translation (demonstrated in English ↔ Italian) also operates seamlessly, supporting over 50 languages. The model supports natural real-time voice conversations and will eventually allow real-time video interactions. For example, users could show ChatGPT a sports broadcast and have it explain the rules. This new voice mode will be released as an alpha version in the coming weeks.

The presenters at the announcement spoke over ChatGPT quite a bit, but the model entered a listening mode even if interrupted mid-sentence, which felt very human-like. OpenAI researchers Barrett Zoff and Mark Chen discussed the numerous applications of GPT-4o, most notably, its live conversational abilities. If an interaction is interrupted during the model’s response, it stops to listen and recalibrates the conversation’s direction. The ability to change GPT-4o’s tone was also demonstrated. Chen asked the AI model to read a bedtime story about “robots and love” and then requested a more dramatic tone. The model’s tone gradually became more theatrical. This continued until Murati requested a “convincing robotic voice.”

Such capabilities of GPT-4o are also being made available to free users. The rollout starts today with text and image functionalities, while the voice mode will be released to ChatGPT Plus users as an alpha version within a few weeks. GPT-4o is also available via API. Text and image models are already available, but voice and video functionalities will be launched to a select few trusted partners within a few weeks.

In conjunction with the release of GPT-4o, a ChatGPT desktop app has also been launched, currently available only for macOS but accessible to free users. This app will eventually support GPT-4o’s advanced voice and video capabilities. The demo using the desktop app showed tasks such as conversing via voice while copying and pasting development code, making the use of ChatGPT as an assistant appear more seamless. It also demonstrated its strong image understanding capabilities, like reading a graph image and pointing out key aspects.

Leave a comment