I have been testing Deepseek R1 today, it was awesome to see it added to Assistant so quickly and I'm very grateful for that, great work!
However, in my testing it appears to have been massively overhyped. Every thread I've used it in, it has run out of output tokens while engaging in circular and mistaken reasoning. It's rare for it to be able to complete its answer. I would link some examples, but it's so common you can just try it for yourself and see how it runs out of tokens, without getting anywhere.
It's a fun development for LLMs, but I might suggest considering dropping it until Deepseek have finetuned the model further. It's using a lot of tokens to run itself around in circles, right now.
Just my first impressions, I would be interested to know how other users are experiencing it?
Expected behaviour: constructive chain-of-thought reasoning that matches ChatGPT o1 as claimed.
Observed behaviour: circular nonsense that quickly exhausts the output token limit without making progress on the problem.
Reproduction: My testing was based on a reddit post from today where three poker players were all dealt the same value hand. Comments were asking what the probability of this was (e.g. all three get dealt A-8 offsuit, or an equivalent offsuit hand). This is harder to model than you might initially expect and will be very unlikely to be found in a LLM training data set. None of the Assistant models did well at this task (Claude, ChatGPT 4o, Gemini), but Deepseek R1 was the only one to catastrophically fail, without making any progress.