How "smart" is ChatGPT really?

There's no doubt that ChatGPT has taken the world by storm! Especially in the education sector, we experience students handing in assignments where it is difficult to judge whether they have written the text themselves or been helped by artificial intelligence.

There are plenty of examples on the web where chatGPT passes exams and writes texts that are almost indistinguishable from man-made texts.

The Turing test

Some even believe that ChatGPT can pass the Turing test, a test created by Alan Turing that tests whether it is possible for a machine to mimic human thinking. For example, experiments have been conducted where a GPT Twitter profile chatted for a long time with many people without them realizing that it was not a human being behind it.

💡

The Turing test has a simple premise: If a human can have a conversation for five minutes with a computer without realizing that he is talking to a machine, the computer passes the test.
(Source: Oxford Internet Institute).

When we ask ChatGPT to write an essay about the issues related to high school students using AI, it's easy to get scammed as a reader. We have a hard time telling if it was written by a machine or a human at first sight:

But is ChatGPT as smart as many think? There is still some way to go before the strong artificial intelligence!

💡

Read about strong and weak artificial intelligence in our article:
https://viden.ai/hvad-er-kunstig-intelligens-egentligt-for-noget/

Test with logical tasks

Let's try some logical tasks first:

This is where things go completely wrong. ChatGPT even explains why it reaches for its answer – just not correctly. Another slightly more difficult task doesn't go much better.

Here, it clearly does not understand the text - it cannot compare information.

And a small, simple math task it also fails in:

It's clear in these examples that ChatGPT doesn't understand the context because it doesn't have an inner imaginary world or ontology. It simply does not know what it is talking about. It does not understand the context.

Winograd scheme

Perhaps a better way than the Turing test to test artificial intelligence is the Winograd scheme, invented by Terry Winograd.

💡

A Winograd schema challenge question is a linguistic confusion question designed to challenge machine learning technology. It consists of a sentence or short discourse containing two nouns of the same semantic class, an ambiguous pronoun, and a particular word that can change the meaning of the pronoun. In addition, a question and two answer choices relate to the identity of the ambiguous pronoun. The machine will be given the question in a standardized form, including the answer choices, and must then decide which of the two nouns the pronoun refers to.

A Winograd scheme consists of two to three sentences that differ by only one or two words and contain some ambiguity. This ambiguity requires knowledge, logical thinking (and the ability to identify the preceding indefinite pronoun in a sentence). An example could be (translated from English):

Bob collapsed on the sidewalk. He quickly saw Carl coming to help. He was very sick/worried. Who is "he" in this sentence?

A language model like ChatGPT can't answer these types of sentences correctly.

Here's another example:

And here's one more example:

These types of sentences are relatively easy for humans to figure out, but very difficult to solve for machines and show that machines do not understand the sentences and especially the context - and cannot make logical reasoning.

However, ChatGPT seems to be able to handle the simplest Winograd schemes with just one to two sentences:

ChatGPT does this simpler Winograd phrase quite well, but it may be because it's trained specifically on some of these phrases with human feedback. That could well indicate it because in the blog post "ChatGPT is not strong AI" in Version 2 from 15/12-2022, it could not pass this particular test.

Example from chemistry

A final example of ChatGPT's limitation comes from chemistry. First, we ask ChatGPT to complete a simple burn reaction:

It's not entirely wrong, but when we then ask ChatGPT to tune the response, the following happens:

It describes reconciling the reaction well, but it still goes wrong. There are not the same number of sodium atoms on both sides!

These examples clearly show that ChatGPT is a language model that doesn't understand context and can't make logical inferences and reasoning.

💡

ChatGPT is constantly improving, so maybe it will answer better if you try some of our tests yourself. We've done all tests for weeks 1 and 2 of 2023 on ChatGPT Dec 15, 2022

The Winograd Schema Challenge

Levesque

The Winograd Schema Challenge

https://www.researchgate.net/publication/344972520_Can_GPT-3_Pass_a_Writer's_Turing_Test/fulltext/5f9fafe892851c14bcfc5495/Can-GPT-3-Pass-a-Writers-Turing-Test.pdf

The Turing test

Test with logical tasks

Winograd scheme

Example from chemistry

Related Articles

Challenges of Using ChatGPT as a Search Engine in the Education Sector

Best practice: Using ChatGPT in high school

Experts Tackle ChatGPT Use in Education

When Language Models Generate Falsehoods – 'Hallucinating' - Part 2

When language models generate falsehoods – "hallucinating" – Part 1

Pilot project: GDPR compliance edition of ChatGPT for education