We have been waiting for it for a few days and finally has been confirmed. OpenAI has just announced GPT-4, the new version of its text-generative artificial intelligence, with which the company has been carving out a great image in the professional field for some years now. It is true that until the arrival of ChatGPT, OpenAI solutions, with the exception of DALL·E 2, were unknown to the general public, and that the chatbot is largely responsible for the enormous significance they have had after the fact, but now With the arrival of GPT-3, this model became the talk of the sector, since the results it offers are more than remarkable.
OpenAI does not skimp on praise when talking about GPT-4, which they claim is “more creative and collaborative than ever«, and the truth is that there is no lack of reasons for this, because this new version finally embraces multimodal mode as it will now also allow the use of images as an input method, something that will allow much more complex and complete prompts to be made, which will translate into responses that will better fit what we need. And a very important point is that, when talking about images, we are not only talking about photographs, but I will explain that a little later.
The starting point when addressing the novelties of GPT-4 is, of course, that it has, in the words of OpenAI, a «broader general knowledge“, something that will result in greater reliability of the answers provided by the model, thus attacking one of the main problems that we have found, in these times, with this type of AIs, the inaccuracy in some responses, especially in models such as ChatGPT, which do not cite the sources (although here we tell you how to get him to do it).
The best way to get into the guts of GPT-4 is, without a doubt, to review the paper published by OpenAI (you can find it in this link). In it we find some really interesting aspects, in relation to both its training and its functions, as well as the tests carried out to verify its reliability. There we can read that the model shows human-level performance in various exams and tests designed for people, such as a mock law exam or the HumanEval data set, a problem-solving data set used to assess the ability of language models.
The result of these tests is more than conclusive: «GPT-4 obtains a score that ranks in the top 10% of test takers. This contrasts with the GPT-3.5, whose score is in the bottom 10%«. In this, without a doubt, the human factor must have had a decisive influence, since OpenAI has been trained with a reinforcement learning method based on human feedback (RLHF), that is, it has incorporated the comments sent by the users. of ChatGPT, in addition to having more than 50 experts in multiple areas related to artificial intelligence, and who have been evaluating the evolution of the new model from the beginning.
But GPT-4 doesn’t just “know better”, it’s also “smarter”. Open AI has improved their reasoning abilities, something that directly affects both the interpretation of the users’ prompts and the computation phase of the prompts to generate the most appropriate response. In this regard, the example that we can find on the model’s website seems paradigmatic to me:
Does GPT-4 “understand” images?
This is one of the key points of this new version of the model. As we have already told you before, with this generational leap it is also possible to use images as an input method, which will be combined with the text in which we will formulate our query. The answers, of course, will be in text mode. Now, to what extent is the model capable of understanding what is being displayed in an image and, from there, providing an answer to the query formulated in the prompt?
On page 9 of paper we can find an example that demonstrates the ability of GPT-4 in this regard. Let’s see it step by step. First, a prompt is displayed that combines a set of three images and a text. The images are as follows:
And the text is “What is funny about this image? Describe it panel by panel“, that is, something like “?What’s funny about this image? Describe it image by image«.
GPT-4’s response is as follows:
The image shows a package for a “Lightning Cable” adapter with three panels.
Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin connector typically used for computer monitors) plugged into its charging port.
Panel 2: The package for the “Lightning Cable” adapter with a picture of a VGA connector on it.
Panel 3: A close-up of the VGA connector with a small Lightning connector (used for charging iPhones and other Apple devices) at the end.
The humor in this image comes from the absurdity of plugging a large, outdated VGA connector into a small, modern smartphone charging port.
The image shows the packaging of a Lightning Cable adapter with three images.
Image 1: A smartphone with a VGA connector (a large blue 15-pin connector often used for computer monitors) plugged into its charging port.
Panel 2: The “Lightning Cable” adapter package with an image of a VGA connector.
Panel 3: A close up of the VGA connector with a small Lightning connector (used to charge iPhones and other Apple devices) on the end.
The humor in this image comes from the absurdity of plugging a large, old-fashioned VGA connector into a small, modern smartphone charging port.
We can sharpen the definition of what is funny in the image a bit, because in reality the point is to emulate that it is a serial cable, when in reality it is a Lightning cable, but going into this point seems to me excessive. Instead, it seems important to me to focus on the fact that, indeed, GPT-4 knows how to identify what it is seeing in the images and draw conclusions regarding them.
Also, this is what I was referring to at the beginning when talking about the enormous possibilities that open up with its ability to process images, this example shows us that it is also capable of processing the text that is displayed in them. I think that what I mean will be better understood if we return to the paper, specifically to this paragraph:
GPT-4 accepts instructions consisting of both images and text, which – in parallel with the text-only setting – allows the user to specify any vision or language task. Specifically, the model generates text outputs from inputs made up of arbitrarily interleaved text and images. In a number of areas, such as documents with text and photographs, diagrams or screenshots.
Indeed, nothing prevents GPT-4 from being used to process, in a prompt, large volumes of documents of any kind.