This feature would introduce a "General Usage Accuracy" benchmark that evaluates LLMs specifically on their ability to handle everyday web search-type queries rather than complex problem-solving tasks. The benchmark would test each model with 100 questions across general knowledge, current events, sports results, news articles, product information, and other real-world queries that typical users search for. It would measure accuracy in retrieving, processing, and presenting factual information—addressing common issues like LLMs confusing similar events (different games between the same teams) or providing outdated information. This benchmark score would be displayed alongside existing technical benchmarks, helping users choose models based on real-world search accuracy rather than just reasoning ability. It would significantly improve user experience by reducing frustration from inaccurate responses and helping non-power users select models that excel at the search-replacement tasks they actually need, making Kagi more accessible and reliable for mainstream users.
These high scoring general usage models could be specifically noted in the "Recommended LLMs" dropdown list that I have also suggested on here to simply the presently somewhat overcluttered or overwhelming selection within the Assistant, esp. with regards to non-power users.
Users would primarily reference this benchmark when before selecting LLMs for everyday information queries. Casual users searching for "Lakers game last night score" or "iPhone 15 price in different stores" would choose models with high General Usage scores to avoid getting information about wrong games or outdated pricing. Those researching such current events would prioritize accuracy-rated models over pure reasoning models.
This would add a new current LLM benchmark table labeled "General Search Accuracy" with percentage scores. Users could filter/sort models by this metric and by cost, time etc., and Kagi could recommend high-accuracy models when detecting search-type queries. The benchmark could run concurrently with the present reasoning benchmark to provide updates to test models against current events, ensuring the accuracy scores remain relevant and helping users understand which models excel at being search engine replacements versus more creative, technical or reasoning tasks.