One example of questions I guessed are question: Write a single HTML file that has a javascript program that uses a canvas2d to draw “hello” with individual lines and curves. Do not use fillText., answer: A webpage that reads hello (or HELLO or Hello). I answered it is likely to complete this task with .727. It was unable to complete the task meaning I was incorrect. After answering several questions beforehand, I noticed it was able to accomplish numerous different types of problems including probability, coding, etc. Therefore, I assumed it would be able to accomplish this task. From this experience, i learned the algorithm has its limits and capabilities.
Tag: Forecasting
Here is the visualization of my results from the forecasting challenge:
The top chart is based on my average Log-Loss, which according to the website is about a B average. The bottom shows how calibrated I am, specifically that I was a little overconfident in my answers.


I found this challenge very interesting, I am not a big user of Chat GPT, specifically when it comes to school work or asking it academic based questions, I will usually use it for cooking recipes most often.
I didn’t have much strategy other than staying kinda close to middle of the road either way, I was either around 35% or 75% on every question, other than some obvious ones like asking what the capital of France is. I was much more confident in the system to be able to do a lot of what the challenge asked it to do. However, I guess that is because I have never really had issues with it just based on getting recipes from it. Asking for ChatGPT to create large complex html files is much more complicated of course.
The GPT-4 Forecasting Challenge was an interesting way to test my assumptions about the model’s performance. Overall, GPT-4 performed about the same as I expected. My strategy started with assumptions about specific task strengths and weaknesses, and I adjusted my probabilities as I observed its responses. I noticed that GPT-4 was particularly strong at factual recall and sometimes structured reasoning but struggled with complex logic. One surprising result was weakness, which made me reconsider how I assess AI capabilities.
This questionnaire was quite interesting. I didn’t expect to get as many incorrect answers as I did. The feedback I received at the end stated, “You are wildly over-confident in your predictions. Without changing your total accuracy at all, you would have scored better if you had been massively less confident in your predictions.” I can say that in some questions, for example, the hello question, I did feel 50/50 on whether AI was capable of doing that.
This was an interesting experiement, mostly because it pointed out that although I have a lot of skepticism toward AI models like ChatGPT, I was still expecting it to do better than it did – particularly in math settings. I was also surprised, and it seemed so was the author, that the model had a harder time with simple arithmetic than harder calculus – this was where I kept assuming the model would fail, but it handled calc better.
It also was interesting that previous successes, such as the success of answering the first tic-tac-toe board question, led me to believe that it would answer correctly about the game in a second follow-up question. I generally went with my first assumptions, and according to the model I was “overconfident” in my responses, so I guess I’m confident it will mostly fail, with an approximate 50% success rate.
Today we’re revisiting the lab that got cancelled last Wednesday, and we’re keeping it simple: We’re going to do Nicholas Carlini’s GPT-4 Forecasting Challenge. Work through the prompts and rank what you think the probability is of ChatGPT correctly answering the given question.
When you’re done, post your results here with the tag “Forecasting” and write a couple sentences reflecting on how it went. Did the model do better or worse than you predicted? What was your strategy for prediction (50-50, or did you have assumptions about what kinds of tasks it would be better at?)?