Suggest LLM's to benchmark

batuhan

Xytronix Oh, I didn't know Qwen 3 was out! Thanks for the heads-up.

Xytronix

Please test the phi4 models from Microsoft

https://huggingface.co/microsoft/Phi-4-reasoning
https://huggingface.co/microsoft/Phi-4-reasoning-plus

partlycloudy

Newbie here. I'm having a hard time reading the benchmark table. Which model is recommended for general purpose use, web access, research on topics, vacation itinerary, general searches, light programming tasks etc?

RoxyRoxyRoxy

partlycloudy I personally love DeepSeek R1, it’s quite good for general purpose while also being cheap, and has a reasoning mode. For programming tasks I’d recommend the premade Code custom Assistant, or o3. Sonnet 3.7 and Ki also get a lot of use from me, but I believe those are more expensive (if not my bad, haven’t looked at the cost sheet in a hot minute 😅)

Xytronix

Doubao Seed 1.6
Doubao Seed 1.6 Thinking
Doubao Seed 1.6 Flash

https://mp.weixin.qq.com/s/CiN0XRWQc3hIV9lLLS0rGA
https://www.volcengine.com/docs/82379/1544106

Gaeilgeoir

Perhaps the updated DeepSeek R1 05/28? If it hasn't already been tested, but under the same DeepSeek R1 name.

o3 Pro would also be interesting to see how well it compares to Gemini 2.5 Pro and others.

nichu42

We need to know which version was tested. And also which versions are live in the Assistant.

Gaeilgeoir

nichu42 Yeah, I think that would be really useful. Especially in today's trend of updating existing models at "checkpoints", instead of releasing entirely new iterations (e.g. llama)

JJ

Magistral small and medium. At present this seems to be missing from all benchmarking groups that I can see.

youngji22

Hunyuan A13B

Thibaultmol

Updated the table with the requested models. It's been a while since the last benchmark. @yiwei-1 will hopefully do another one soon.

@JJ Magistral was recently included in Kagi Assistant but was taken out again because it seems to have weird problems with it's thinking process getting in quite literally a loop and just failing. So it probably would fail the benchmark as well until that issue is fixed; unfortunately

fs1010

I think Moonshot’s Kimi K2 would be a good addition to this list. And Grok 4 if you can get around the rate limit 😛

Searcher23

Perhaps worth trying GLM4.5 as well - https://z.ai/blog/glm-4.5

tauon

The open model Deepseek V3.1 seems to outperform a lot of closed-source competitors for price-to-quality right now based on the initial reactions I've read.

kg4maz

Thibaultmol How about Apertus from Switzerland? It seems good so far!

eltaco

grok 4.1 fast and claude 4.5 opus. (each reasoning and non reasoning)

« Previous Page