Steps to reproduce:
- Find a pretty content heavy document like https://arxiv.org/pdf/2307.09009.pdf
- Summarize it through Universal Summarizer on the “Summary” mode
- Observe a long, detailed summary which includes key takeaways
Expected behavior:
I expected a paragraph summary like it outputs on other documents and articles.
Debug info:
US East
Image/Video:
The performance and behavior of large language models like GPT-3.5 and GPT-4 can vary significantly over time, even within a few months. The authors evaluated the March 2023 and June 2023 versions of these models on 4 tasks:
1) Solving math problems
2) Answering sensitive questions
3) Code generation
4) Visual reasoning
They found substantial differences in the models' performance and behavior across the two releases:
• GPT-4's accuracy in solving math problems dropped from 97.6% to 2.4% from March to June. GPT-3.5's accuracy improved from 7.4% to 86.8%.
• GPT-4 was less likely to answer sensitive questions directly in June compared to March, indicating a stronger safety layer. However, it provided less rationale for refusing to answer questions.
• The percentage of directly executable code generated by GPT-4 dropped from 52% to 10% from March to June. GPT-3.5 also showed a large drop.
• GPT-4 and GPT-3.5 showed only marginal improvements in visual reasoning, around 2% from March to June.
The findings highlight the need to continuously monitor and evaluate large language models in production to ensure stable performance over time. The behavior of "the same" model can change substantially with updates.
In summary, the key impact is that large language models require continuous monitoring and evaluation to account for performance drifts over time, even with minor updates. Without such oversight, integrating these models into applications can become unstable and unpredictable.