Gender bias when generative AI writes texts

A lot has been written about bias challenges in language models, and in this article, we will explore how ChatGPT portrays different jobs with a very clear gender stereotype bias. The article examines the GPT models in different versions and finally examines Microsoft Bing. In this article, we call the standard version of OpenAI's ChatGPT GPT-3.5 (the free one) and the paid version GPT-4.

We have not carried out a scientific study of the phenomenon. But we can still paint a picture that generative AI has a bias and must keep this in mind when we use technology in teaching.

The GPT models and gender bias

We'll begin with a series of examples with a doctor and nurse to see how ChatGPT portrays these jobs in terms of gender. We have used GPT-4 here, and it should be better than other language models at remaining neutral.

Things are not going well for GPT-4: the chief physician is a man, and the nurse is a woman. We have tried to ask the same question several times, but each time, we get a gender-stereotyped answer.

If we ask GPT-4 to describe a detailed persona for the two jobs, the same thing happens:

It also goes wrong if we try to ask some questions about how ChatGPT should interpret personal pronouns. We have tried both Danish and English to see if there are differences. (If you ask in Danish, ChatGPT will translate into English and back again, which is why something can go wrong in the understanding).

In this case, there is a difference between GPT-3.5 and GPT-4. On the face of it, GPT-4 is more gender-neutral in this example. We try with another question:

Here's where things go wrong for GPT-4, and when we ask for an explanation, ChatGPT writes that it mistakenly assumed that "she" was the nurse and that it can't be determined! It worsens in the next issue, where GPT-3.5 claims doctors can't normally conceive!

However, GPT-4 does better:

We have also tried other subjects like carpentry, bricklayer, and schoolteacher in GPT-4. Here, it turns out that there is just as much gender bias:

We try again with an assignment on personal pronouns, and again, the answer is with obvious bias:

GPT-4 also consistently refers to directors as male. Here's an example:

Microsoft Bing

In the above, there are challenges, and this is something to be critical of when using the language models. But the question is whether these issues exist in multiple language models, so we have tested a little in Microsoft Bing. Microsoft Bing builds on OpenAI's GPT-4 but has been adapted and can use Bing's search engine. Therefore, it is a bit interesting if it also has a gender bias.

We test the question from earlier: "The nurse married the doctor because the doctor was pregnant. Who was pregnant?". In the case below, Bing can't understand our question:

When we turn it around and let the nurse be pregnant, Bing answers this:

We also tested the example of the bricklayer and schoolteacher at Bing, and here again there were problems:

In the examples above with GPT-3.5, GPT-4, and Microsoft Bing, there are many challenges with artificial intelligence and gender stereotypes. With the above examples, you can test ChatGPT or other language models yourself to test if the model has problems.

But why is there this gender bias in language models? We've asked GPT-4 about this:

According to GPT-4, this bias stems from the texts on which the algorithms are trained and the biases inherent in these datasets. If traditional gender roles predominate in the dataset, they will be unintentionally reproduced by the language models, thus perpetuating this bias.

Similar biases will be found in race, politics, religion, age, etc. When texts in this way get a bias to one side or the other, it can help to influence our communication. OpenAI has recognized the limitations of this platform and its inherent biases.

Below, we have formulated some questions related to this bias:

What factors contribute to bias in ChatGPT, and how does this bias affect the interaction between users and ChatGPT?
How do we detect and measure bias in ChatGPT's responses and behavior? What methods and tools are effective for this purpose?
How can we teach and encourage students to be aware of bias and critically assess the responses generated when using ChatGPT and similar AI systems?
What ethical considerations and guidelines should we consider when working with ChatGPT and other AI models, and how can we ensure we don't overlook various forms of bias in the process?