“OpenAI President Greg Brockman has shared on his account X the first publicly available image generated using the company’s new model GPT-4o. The image depicts a person writing on a board about ‘cross-modal transfer.’ The illustration appears realistic, showcasing the accuracy of text generation and significantly surpassing the quality of DALL-E 3. However, the image generation features of GPT-4o are not yet available to the general public.
The image looks photorealistic: a person in a black OpenAI logo t-shirt writes with chalk on a board the text: ‘Cross-Modal Transfer. Suppose we directly model P (text, pixels, sound) with a single large autoregressive transformer. What are the pros and cons?’
The new model GPT-4o, unveiled on May 13th, is an improved version of the previous GPT-4 lineup (GPT-4, GPT-4 Vision, and GPT-4 Turbo). The innovation surpasses them in various parameters: processing speed, data processing cost, and the ability to retain more information from input audio and video streams.
This became possible because OpenAI applied a different approach. Previous language models GPT-4 combined the work of several separate models, converting other data formats, such as audio and images, into text and back. The new GPT-4o was initially trained on multimedia tokens, allowing it to directly analyze and interpret visual and auditory information, bypassing the text transformation stage.
Judging by the image, the new approach represents a significant improvement compared to DALL-E 3, OpenAI’s latest image generation model introduced in September 2023. VentureBeat journalist ran a similar prompt through DALL-E 3 in ChatGPT. The DALL-E 3-generated image significantly lagged behind GPT-4o in terms of quality, photorealism, and text generation accuracy.
However, the image generation capabilities of GPT-4o are not yet available to the general public. In his post, Brockman wrote: ‘The team is working diligently to make them available to the world.'”