Suggest LLM's to benchmark

Thibaultmol

Kagi does its own benchmarking for LLM's. There are many LLm's at this point with new ones being released every week or so.

https://help.kagi.com/kagi/ai/llm-benchmark.html

note: already tested models are no longer mentioned in this table

Model	Reason for adding
Mistral: Codestral 25.01	Announcement - "Only EU-based LLM designed for coding." (nichu42) - "There is an open source version of Codestral but it's less good I assume" (Thibaultmol)
Mistral: Ministral 8B	Announcement - "Efficient intermediaries for function-calling in multi-step agentic; potentially interesting for Ki." (nichu42)
bitnet_b1_58-3B	Announcement - "Potentially low cost coding 'lite' model?" - Thibaultmol
bitnet_b1_58-large	Announcement - "Potentially low cost coding 'lite' model?" - Thibaultmol
MiniMax‑VL‑01	GitHub
Phi‑4 Reasoning	Hugging Face	Reasoning Plus was tested
Doubao Seed 1.6	Announcement - All-in-One comprehensive model, first domestic model supporting 256K context with thinking capabilities. Supports deep thinking, multimodal understanding, and GUI operations with adaptive thinking modes. ]
Doubao Seed 1.6 Thinking	Announcement - Enhanced version focused on deep thinking capabilities with improvements in code, mathematics, and logical reasoning. Supports 256K context.
Doubao Seed 1.6 Flash	Announcement - Ultra-fast version with extremely low latency (TOPT only 10ms). Supports deep thinking, multimodal understanding, and 256K context with vision capabilities matching flagship models.
Hunyuan A13B	GitHub Release - Mixture of Experts (MoE) model with 80B total parameters and 13B active parameters, delivering high performance with efficient resource usage. Released June 27, 2025.
Qwen3‑8B	Hugging Face – “Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture‑of‑experts (MoE) models.” (Xytronix)
Apertus	Developed by Swiss research institutes, including EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS). Designed with a strong emphasis on transparency and privacy, making all development artifacts and data processes available for independent review.

Let me know if you want to track any other models or need more details!
If you need a summary of which models are most likely to be tested soon, or want to know about other benchmarks, just ask.

What are some models that you think Kagi should benchmark and also mention any reasons why you think that a certain model would make sense for Kagi to add to The Assistant (a model might be very fast and very cheap but not score well. It might still be a valid addition).

Exact model name(s)
Relevant links (you don't list the api's, just like announcement of the model, or general product page)
Reason why you think it should be benchmarked, considered for Kagi Assistant (can just be "curious how it performs" but might also be "it's cheap and fast"

nichu42

Thibaultmol

Mistral: Codestral 25.01
https://mistral.ai/news/codestral-2501
https://openrouter.ai/mistralai/codestral-2501
Only EU-based LLM designed for coding.
Mistral: Small 3.1
https://mistral.ai/news/mistral-small-3-1
https://openrouter.ai/mistralai/mistral-small-3.1-24b-instruct
It seems to perform very well and might be a very good choice if you want a LLM for easy tasks and quick answers that is based in Europe.
Mistral: Ministral 8B
https://mistral.ai/news/ministraux
https://openrouter.ai/mistral/ministral-8b
Just curious, to be honest. Mistral says "efficient intermediaries for function-calling in multi-step agentic", so I wonder if this could be interesting for Ki.

kg4maz

Thibaultmol How about Apertus from Switzerland? It seems good so far!

OldMan973471

I would be interested to see Grok 3 and Grok 3 mini (now that the APIs are out) on the benchmark as a point of reference for other LLMs.

Not sure about the URLs: https://api.x.ai/v1/chat/completions ?
grok-3-mini-beta
grok-3-beta
Not suggesting these for Kagi Assistant Integration at this point.

nichu42

OldMan973471 Are they different from those that are currently available in Kagi Assistant?

OldMan973471

nichu42 I've had a good experience with Grok 3 and Grok 3 mini - both are pretty high on other benchmarks : https://livebench.ai and one on https://lmarena.ai/?leaderboard. I'm curious how it would do with Kagi's benchmark.

Thibaultmol

nichu42 it's the same ones i think, but they haven't actually gone through the kagi llm benchmark, hence why I did put them in the table

batuhan

The quality of the benchmark is what's important.
There's a lot missing and wrong.

For instance, I think Kagi should have a 'Web Access' benchmark category or something similar.

batuhan

batuhan https://kagifeedback.org/d/6858-suggestions-for-the-kagi-benchmark

Xytronix

o3 and o4 mini
https://openai.com/index/introducing-o3-and-o4-mini/
Both are industry leading models, performing reasonably well.

Xytronix

Gemini 2.5 Flash Thinking Preview
https://blog.google/products/gemini/gemini-2-5-flash-preview/

batuhan

o4-mini-high ❤️

Xytronix

MiniMax-Text-01 and MiniMax-VL-01
https://github.com/MiniMax-AI/MiniMax-01

Performing reasonably well on OpenAI's MRCR

Xytronix

Qwen 3 models (Qwen3-8B - Qwen3-235B)

Qwen3-8B

Qwen3 Highlights

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:

Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance.
Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens.
Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales.
Model Overview

Qwen3-8B has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 8.2B
Number of Paramaters (Non-Embedding): 6.95B
Number of Layers: 36
Number of Attention Heads (GQA): 32 for Q and 8 for KV
Context Length: 32,768

https://huggingface.co/Qwen
https://modelscope.cn/collections/Qwen3-9743180bdc6b48

batuhan

Xytronix Oh, I didn't know Qwen 3 was out! Thanks for the heads-up.

Xytronix

Please test the phi4 models from Microsoft

https://huggingface.co/microsoft/Phi-4-reasoning
https://huggingface.co/microsoft/Phi-4-reasoning-plus

partlycloudy

Newbie here. I'm having a hard time reading the benchmark table. Which model is recommended for general purpose use, web access, research on topics, vacation itinerary, general searches, light programming tasks etc?

RoxyRoxyRoxy

partlycloudy I personally love DeepSeek R1, it’s quite good for general purpose while also being cheap, and has a reasoning mode. For programming tasks I’d recommend the premade Code custom Assistant, or o3. Sonnet 3.7 and Ki also get a lot of use from me, but I believe those are more expensive (if not my bad, haven’t looked at the cost sheet in a hot minute 😅)