The launch of GPT-4, the new OpenAI language model, is stealing all the headlines these days. And Sam Altam’s own startup doesn’t hesitate to show off the capabilities of the technology that’s already available in ChatGPT Plus and more than a dozen other apps and services. In fact, the company has published a document that shows that its new artificial intelligence it is still better than ChatGPT to pass university or postgraduate exams.
The GPT-4 technical report dedicates a good section to the performance obtained by the language model when facing a large number of academic tests. In most cases, the new OpenAI technology exceeds the results achieved by GPT-3.5, for example.
Thus, the developers of artificial intelligence have shared a table with the results obtained in the exam to access the law school, such as the Law School Admission Test (LSAT), the standardized university entrance tests of (SAT) and the Graduate School Record Tests (GRE), among many others.
Most of the results obtained by GPT-4 have been better than those of GPT-3.5, and in some cases are above average score. As mentioned The Princeton ReviewFor example, the highest score that can be obtained on the LSAT is 180, while the average is 152. To obtain the latter, around 60 questions must be answered correctly, out of a total that is usually between 99 and 102. In this case, the artificial intelligence of OpenAI achieved a score of 163against the 149 of its predecessor.
GPT-4 continues to improve by taking college or graduate exams
When facing the tests to access the bar association, GPT-4 obtained a score of 298 out of 400. It is worth noting that, in this case, the results comprise three different exams: the Multistate Bar Examination (MBE), the Multistate Essay Examination (MEE) and the Multistate Performance Test (MPT). Each of them is carried out under a different modality, such as multiple choice tests or questions that must be solved in a certain number of minutes.
On the SATs for math and evidence-based reading and writing, he also did very well. There she obtained scores of 700 and 710 over 800, respectively. A clear improvement over GPT-3.5, which had achieved 590 and 670 out of 800, respectively.
While in the GRE, GPT-4 stood out at a verbal and quantitative level, but he was unable to improve his performance on the written exam. In these postgraduate exams she achieved scores of 169/170 (verbal), 163/170 (quantitative) and 4/6 (written). As a comparison, the GPT-3.5 results had been 154/170, 147/170 and 4/6 in the same modalities.
OpenAI ensures that the exams that its new language model took were the same that any human must face at the corresponding academic levels. And it maintains that no specific training was carried out on said tests. “A minority of the problems included in the exams were seen by the model during training. For each exam we run a variant with these questions removed and report the lower score of the two. We believe that the results are representative”, indicates the startup.
The AI evolves, but it maintains known problems
Beyond the evolution represented by GPT-4, which in some aspects is already making the original version of ChatGPT look ridiculous, still has known issues. OpenAI has mentioned that the limitations of its new language model remain similar to its predecessor. Something that is especially appreciated when “inventing” facts when providing answers, which impacts their reliability.
Despite its capabilities, the GPT-4 has limitations similar to those of previous GPT models. More importantly, it is still not completely reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model output, particularly in high-risk contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-risk uses altogether) that matches needs. of specific applications.
OpenAI, on the limitations of GPT-4.
Now, returning to the subject of academic tests, the hype for the “ability” of GPT-4 to overcome them did not wait. But we return to the same thing that we raised when ChatGPT did the same with medicine or law exams: it is useless for the AI to pass them.
We fall back into the old story of wanting to anthropomorphize artificial intelligence. For the umpteenth time, no: passing GPT-4 admission exams does not mean that you can apply as a student at Stanford or any other renowned university in the United States.
Joshua Levi, an AI expert, left a very interesting concept about it. “GPT-4 passing the LSAT or GRE is incredibly impressive. At the same time, I think we need a reminder of a logical fallacy We’ll see a lot this week: Just because software can pass a test designed for humans doesn’t mean it has the same abilities as humans who pass the same test. Human exams do not test abilities that most or all humans have. What they test are the skills that are most difficult for them.” tweeted.