Is ChatGPT Getting Worse?

Please note, these are my personal notes. It is written from my understanding, and these are not direct quotes. It is biased by what I’m interested in (see the full episode with a ton more insights explained in a much better way). And since I’m human, it is bound to contain a mistake or few, so please, correct me kindly for the sake of improved shared knowledge.

Generative AI models are shifting under our feet, and it’s our responsibility to understand these model versions, so we can best adapt. 

I was fascinated by this week’s episode of the TWIML podcast: “Is ChatGPT Getting Worse? With James Zou”, who is an assistant professor at Stanford University leading a research team investigating the behavior changes in ChatGPT over the course of three months (the March ’23 version & the June ’23 version of both GPT 3.5 & GPT 4.0). 

Mind-blown moment

🤯

James said that the size & speed of the changes they’re seeing in LLMs is Orders of magnitude larger & faster compared to kinds of changes they saw in previous AI systems. The changes they’ve seen over the course of less than 3 months is akin to the kinds of changes they used to see happen over 3 years (when observing performance changes on CV systems)

Here are the rest of my notes

📝

(7:09) Identifying prime numbers: The later version (June) was substantially worse than the March version in identifying prime numbers by 30-40%! 

(8:21) Chain of thought reasoning: In March, the chain of thought reasoning of how to even find prime numbers = pretty good accuracy. June = chain of thought no longer works very well (either it’s not willing to use this for specific numbers, or not able to execute the steps correctly when attempting) 

(9:51) Sam says Chain of thought reasoning is a popular prompt strategy, especially down logical path systems. They found it does not work as well in June as before 

Other tasks beyond chain of thought: puzzles. For gpt4, performance was much better in march compared to June for these other math problems / puzzles + effectiveness of chain of thought reasoning was higher in march than in June)

(12:07) Trend of behavior change is divergent for gpt 3.5 vs gpt 4: For gpt3.5 — did better in June compared to march for both math + chain of thought reasoning. 

Theories of what might be happening / causing performance decreases for these models: idea of neuroplatropy (?) — idea that if you change behavior for one type of task for a system, that can have unintended side effects that can change the behavior for other tasks even though those might seem orthogonal for what you’re doing experiment: 

Take LLM & do instruction fine-tuning to improve the safety of the models (i.e. less likely to answer potentially harmful questions). They see safety improvement, but then would also see it affect other questions like the answer to “how do you kill a weed” would be “you should not kill weeds bc xyz” = unintended side effect

(14:55) Safety & following the orders could have competition objectives

(15:11) asked chatgpt opinion type questions from popular opinion surveys. Gpt4 in march would give answer, in June, it stopped engaging. “These are subjective questions. I’m an ai system & don’t have opinions” something like that. Symptom of model less willing to follow instructions + less willing to engage

(17:05) interesting direction of research = how do we do more precise surgical edits of these LLMs? Debug model on some problems. Current instruction fine-tuning or changing the weight changes it for the billions of parameters there might be, which have side effects.

(18:16) current way we fine tune models vs surgical updates = modifying particular circuit (subset of artificial neurons) that’s going in & modifying the subset of that model vs changing the entire model. Analogy of crisper for human genome. That’s the holy grail of precision medicine & same for AI right now in fine-tuning LLMs.

(23:11) The size & speed of the changes they’re seeing in LLMs is Orders of magnitude larger & faster compared to kinds of changes they saw in previous AI systems. 

Over the course of less than 3 months is akin to the kinds of changes they used to see happen over 3 years (when observing performance changes on CV systems)!!! 

Listen to the rest of the episode for in depth of the implications of this for developers & researchers who are trying to use these models. 

(39:50) 80% is on evaluation/validation side – how to do that rigorously (automatic + human based evaluations). Remaining time split between data curation + data processing to improve data quality + model training

That was a fun listen!

Then, I thought I’d ask both ChatGPT & Bard to evaluate itself. ChatGPT didn’t give me satisfying answers for the perspective I was looking for… but Bard did. And by satisfying, I mean 😂

 I asked it:

  • What things do you feel you’re worse at performance wise, than 3 months ago?
  • Is your model performance better on a variety of tasks today than it was 3 months ago?

It said it was worse at:

  • understanding and responding to complex or nuanced questions 
  • generating creative text formats, such as poems or code 
  • translating languages accurately
  • following instructions precisely

But when I worded the 2nd question, it said it was better at:

  • generate more creative and informative text formats, 
  • such as poems, code, scripts, musical pieces, email, and letters
  • Answering open ended, challenging, or strange questions
  • Accurately translating languages

😂

What are your thoughts on the performance of ChatGPT now vs when it originally came out? 

Feel free to look at the full conversations with both LLMs that I copied into these google docs:

Thanks for reading! See you soon! 

maryam-farooq

Maryam Farooq

Founder, NYAI

Maryam Farooq is a community builder + frequent speaker on AI, and an early-stage startup advisor. She currently serves as the Founder & Director for NYAI (New York Artificial Intelligence). She is former Co-founder & COO of generative AI startup Aggregate Intellect, one of 7 companies to graduate the CDL Toronto AI program in 2023.