Xytronix Please test the phi4 models from Microsoft https://huggingface.co/microsoft/Phi-4-reasoning https://huggingface.co/microsoft/Phi-4-reasoning-plus
partlycloudy Newbie here. I'm having a hard time reading the benchmark table. Which model is recommended for general purpose use, web access, research on topics, vacation itinerary, general searches, light programming tasks etc?
RoxyRoxyRoxy partlycloudy I personally love DeepSeek R1, it’s quite good for general purpose while also being cheap, and has a reasoning mode. For programming tasks I’d recommend the premade Code custom Assistant, or o3. Sonnet 3.7 and Ki also get a lot of use from me, but I believe those are more expensive (if not my bad, haven’t looked at the cost sheet in a hot minute 😅)
Xytronix Doubao Seed 1.6 Doubao Seed 1.6 Thinking Doubao Seed 1.6 Flash https://mp.weixin.qq.com/s/CiN0XRWQc3hIV9lLLS0rGA https://www.volcengine.com/docs/82379/1544106
Gaeilgeoir Perhaps the updated DeepSeek R1 05/28? If it hasn't already been tested, but under the same DeepSeek R1 name. o3 Pro would also be interesting to see how well it compares to Gemini 2.5 Pro and others.
Gaeilgeoir nichu42 Yeah, I think that would be really useful. Especially in today's trend of updating existing models at "checkpoints", instead of releasing entirely new iterations (e.g. llama)
JJ Magistral small and medium. At present this seems to be missing from all benchmarking groups that I can see.
Thibaultmol Updated the table with the requested models. It's been a while since the last benchmark. @yiwei-1 will hopefully do another one soon. @JJ Magistral was recently included in Kagi Assistant but was taken out again because it seems to have weird problems with it's thinking process getting in quite literally a loop and just failing. So it probably would fail the benchmark as well until that issue is fixed; unfortunately
fs1010 I think Moonshot’s Kimi K2 would be a good addition to this list. And Grok 4 if you can get around the rate limit 😛
tauon The open model Deepseek V3.1 seems to outperform a lot of closed-source competitors for price-to-quality right now based on the initial reactions I've read.