## Evaluating ChatGPT’s Forecasts

In our previous post, we explored the potential of ChatGPT as a forecasting support tool. In this post, we put ChatGPT to the test and evaluate its predictions made entirely on its own, without any human assistance. To do this, we will use the normalized mean square error (NMSE) as our evaluation metric. The NMSE is a measure of the accuracy of a prediction. It is calculated by dividing the mean square error (MSE) of the prediction by the variance of the true values. In general, the NMSE is preferred over the MSE when you want to compare the accuracy of different predictions that are based on datasets with different variances.

``def calc_nmse(true_values, predicted_values):  """Calculate the normalized mean square error (NMSE)"""  # Calculate the mean square error (MSE)  mse = sum([(y - ŷ)**2 for y, ŷ in zip(true_values, predicted_values)]) / len(true_values)    # Calculate the variance of the true values  variance = sum([(y - sum(true_values)/len(true_values))**2 for y in true_values]) / (len(true_values) - 1)    # Calculate the NMSE  nmse = mse / variance    return nmse``

If you want to do your own estimations and compare them to ChatGPT, don’t scroll further and estimate them here:

1. How many cars are there in the United States?
2. How many minutes of video are uploaded to YouTube every day?
3. How many flights take off from airports around the world every day?
4. How many babies are born every day?
5. How many people visit Disneyland every year?
6. How many cells are there in the human body?
7. How many words are there in the English language?

We now let ChatGPT estimate the following values. We used the following chat message: “Estimate via Fermi quiz method QUESTION.”

1. How many cars are there in the United States?
Estimated: 495 million cars
Actual: 276 million cars
2. How many minutes of video are uploaded to YouTube every day?
Estimated: 333,333,333 hours
Actual: 720,000 hours
3. How many flights take off from airports around the world every day?
Estimated: 250,000 flights/day
Actual: 100,000 flights/day
4. How many babies are born every day?
Estimated: 400,000 people
Actual: 385,000 babies
5. How many people visit Disneyland every year?
Estimated: 18 million people
Actual: 8.5 million visitors
6. How many cells are there in the human body?
Estimated: 100 trillion
Actual: 30 trillion
7. How many words are there in the English language?
Estimated: 500,000
Actual: 171,146 words

The NMSE of ChatGPT is 5.44.
A value of 0 indicates a perfect fit, while a value greater than 1 indicates a poor fit.

Have you calculated the NMSE for your forecasts? If so, please leave a comment with your result or send me your result directly. It would be interesting to see how ChatGPT’s performance compares to that of a human forecaster.

## Superforecasting with ChatGPT

The Fermi Quiz is a powerful tool for making accurate estimates and solving problems quickly. Named after physicist Enrico Fermi, this method involves breaking a problem down into smaller, more manageable pieces and using your knowledge and experience to make educated guesses. By following a few simple steps, you can use the Fermi Quiz to solve problems ranging from estimating the number of coffee shops in a city to calculating the number of stars in the universe. In this post, I will explain how to use the Fermi Quiz to make accurate estimates and demonstrate how ChatGPT, a chatbot, can help us generate more manageable pieces for our estimates and may even improve them.

## Fermi Quiz

The Fermi Quiz is a method of solving problems and making estimates by breaking a problem down into smaller, more manageable pieces and using your knowledge and experience to make educated guesses. Here’s how it works:

1. Define the scope of your estimate: First, you need to clearly define the problem or question that you are trying to solve. This will help you focus your efforts and make it easier to come up with a good estimate.
For example: How many bike stores are in the Netherlands?
2. Once you have defined the scope of your estimate, you can begin to break the problem down into smaller, more manageable pieces that help you answer the overall question independently.
For example:
1. Piece:
How many bike stores are in a dutch city on average? How many cities are in the Netherlands?
2. Piece: How many people in the Netherlands go on average in one week to a bike store? How many people can one bike store handle in a week?
3. Piece: How many bikes are in the Netherlands? How many bikes have an average bike store sold since its initial opening?
3. Answer all questions and estimate the actual value for the overall question with each piece independently. Average all of the estimates together to get the final estimate. This method is based on the wisdom-of-crowds effect, which states that averaging independent judgments often leads to improved accuracy.

## ChatGPT for manageable piece generation

As a rule of dumb, more manageable pieces make your final result more precise. However, at some point, it can be difficult to generate more pieces.
Therefore, we can utilize the chatbot ChatGPT to do it for us. You can use the following messages to generate the pieces via ChatGPT (note that the ChatGPT outputs vary, so you may have to tweak the messages a bit):

Estimate how many bike stores are in the Netherlands by using the Fermi quiz method and do not give me estimates.

What are five examples of breaking the problem down into smaller, more manageable pieces that I mentioned in my previous response?

[MULTIPLE IDEAS] (Piece 2 and Piece 3 were actually created by ChatGPT)

Estimate each generated manageable piece a value and average it with your previous estimated values.

## Why did I not want to get an estimate from ChatGPT yet?

Estimate how many bike stores are in the Netherlands by using the Fermi quiz method and do not give me estimates.

The anchoring effect is a cognitive bias that refers to the tendency for people to rely too heavily on the first piece of information they receive (the “anchor”) when making decisions or judgments. This can lead to distorted judgments and decisions, as people may give too much weight to the initial anchor and not consider other relevant information. Therefore, knowing the estimate of the chatGPT (which is not necessarily precise) may influence your estimate.

## Can ChatGPT improve our forecasting?

Now for every manageable piece, we use ChatGPT to get some estimates. Note that multiple times, the same question results in different estimates. This is not a big problem and we can handle it by, for example, averaging the estimates for each subquestion.

Let’s calculate the ChatGPT estimates.

### 1. Piece

How many bike stores are in a dutch municipality on average? How many cities are in the Netherlands?

Estimate via the Fermi quiz method how many bike stores are in a dutch municipality on average?
-> ANWSERS: 5

Estimate via the Fermi quiz method how many municipalities are in the Netherlands.
-> ANWSER: 233

ESTIMATE:
5 * 233 = 1165

### 2. Piece

How many people in the Netherlands go on average in one week to a bike store?
->
525000
How many people can one bike store handle in a week?
-> 500

ESTIMATE:
525000/500=1050

### 3. Piece

How many bikes are in the Netherlands?
->
35 million bikes
How many bikes have an average bike store in the Netherlands sold in its life span?
-> 10000 bikes

ESTIMATE:
35,000,000/10,000 = 3500

FINAL CHATGPT ESTIMATE: (1165 + 1050 + 3500)/3 = 1905

Now that we have generated additional pieces using ChatGPT, we can average its estimate with your own to create a more precise estimate for the problem. To see how accurate your final estimate is, you can compare it to the actual number of bike stores in the Netherlands, which was approximately 3080 in 2020.