OpenAI has announced a significant leap forward in artificial intelligence with the introduction of GPT-4o. This new model represents a major step towards more natural human-computer interaction by processing and generating outputs across audio, video, and text formats.
Last month, we also wrote a feature where we explained OpenAI’s speech cloning and 5 things to know about synthetic voice tech. You can have a look at it too.
New features that OpenAI has brought
Real-time response and enhanced performance
One of GPT-4o’s key strengths lies in its ability to respond to audio prompts as quickly as 232 milliseconds, with an average response time of 320 milliseconds. This processing speed is remarkably close to human response times in conversation, paving the way for more fluid and natural interactions.
Additionally, GPT-4o matches the performance of GPT-4 Turbo on text tasks in English and code, while demonstrating significant improvements in handling non-English languages. Notably, it achieves this level of performance while being significantly faster and more cost-effective than its predecessor.
It can now understand audio and visual data
A defining characteristic of GPT-4o is its ability to understand and respond to information presented in various forms. While previous models focused primarily on text, GPT-4o excels at comprehending and generating audio and visual data. This opens doors to a wide range of applications, including:
- Enhanced customer service: Imagine a customer service representative who can understand your questions and respond in a natural voice, translating languages in real-time if needed.
- Creative collaborations: GPT-4o’s ability to generate visuals based on text descriptions could empower artists and designers by fostering a more interactive creative process.
- Educational tools: Imagine a language learning app that can not only translate speech but also point to objects and translate their names, creating a more immersive learning experience.
How does OpenAI address safety and limitations of the new features
OpenAI acknowledges the potential safety risks associated with GPT-4o, particularly in its audio capabilities. To mitigate these risks, the model is trained with filtered data and undergoes post-training refinements. Additionally, safety systems are in place to regulate voice outputs. OpenAI has conducted thorough evaluations and external testing to identify and address potential risks.
While GPT-4o represents a significant advancement, the developers acknowledge some limitations. They encourage users to provide feedback on tasks where GPT-4 Turbo might still outperform GPT-4o, allowing them to further refine the model.
Who all gets to use it and by when
OpenAI plans to introduce GPT-4o’s functionalities iteratively. Text and image capabilities are already being integrated into ChatGPT, with availability in both the free tier and the Plus tier (with increased message limits). A new alpha version of Voice Mode powered by GPT-4o is also planned for release within ChatGPT Plus in the coming weeks.
Developers can access GPT-4o’s text and vision functionalities through the API, experiencing faster processing speeds, lower costs, and higher rate limits compared to GPT-4 Turbo. The rollout of audio and video functionalities within the API is anticipated for a limited group of partners in the near future.
What analyst had to say about the new GPT
Simon Baxter, principal analyst at TechMarketView, commented regarding the new features on the GPT, saying, “Now I don’t want to get too carried away with the hype, but I must admit on first glance it truly is an impressive step forward. The short demo from OpenAI showed how GPT-4o can converse so naturally compared to the slow, and largely unhelpful voice assistants we are used to. This new version of ChatGPT is capable of real-time conversational speech, which includes the ability for you to interrupt it, ask it to change tone, and react to user emotions.”
He further continued, “During the demo we saw ChatGPT make up a bed time story, demonstrating the ability to sound not just natural, but dramatic and emotional, it could also sing and tell the story with varying degrees of intensity. Language is an area it seems to really excel at, seamlessly translating between Italian and English in real-time. It could also be a game changer when it come to Education, acting as an invaluable personal tutor. During the demo it used a combination of new vision capabilities and conversational AI to walk the user through how to solve an equation, adapting to what was written, but without just giving the answer. It could also view and analyse code, describe potential issues and even in layman terms explain what the code actually does.”
Simon also explained, saying, “This type of conversational AI is what many of us (certainly myself) have been waiting so long for. It has echoes of the fantastic 2013 movie ‘Her’, and yes while there are many kinks to work out, it paints a picture of the future state of AI we are heading towards. Give it another 5 years (maybe even less), when it is fully embedded across our devices, cars and other technologies, and I expect conversing with AI assistants will become such a natural day-to-day occurrence many of us will wonder how we ever did without them. There are already rumors Apple is in talks to incorporate OpenAI’s models and ChatGPT into its products, if true bringing such capabilities as seen in GPT-4o to our smartphone would be a significant next step towards that reality.”
Our take in this regard
The arrival of GPT-4o signifies a significant advancement in the field of AI. Its ability to seamlessly process and generate information across multiple modalities paves the way for more intuitive and interactive human-computer interactions. As OpenAI continues to refine and expand GPT-4o’s capabilities, we can expect to see a surge in innovative applications across various sectors.