"General Usage" benchmarking for LLMs in assistant (evaluation of each for accurate searches to general queries)

JJ

This feature would introduce a "General Usage Accuracy" benchmark that evaluates LLMs specifically on their ability to handle everyday web search-type queries rather than complex problem-solving tasks. The benchmark would test each model with ₁₀₀ questions across general knowledge, current events, sports results, news articles, product information, and other real-world queries that typical users search for. It would measure accuracy in retrieving, processing, and presenting factual information—addressing common issues like LLMs confusing similar events (different games between the same teams) or providing outdated information. This benchmark score would be displayed alongside existing technical benchmarks, helping users choose models based on real-world search accuracy rather than just reasoning ability. It would significantly improve user experience by reducing frustration from inaccurate responses and helping non-power users select models that excel at the search-replacement tasks they actually need, making Kagi more accessible and reliable for mainstream users.

These high scoring general usage models could be specifically noted in the "Recommended LLMs" dropdown list that I have also suggested on here to simply the presently somewhat overcluttered or overwhelming selection within the Assistant, esp. with regards to non-power users.

Users would primarily reference this benchmark when before selecting LLMs for everyday information queries. Casual users searching for "Lakers game last night score" or "iPhone 15 price in different stores" would choose models with high General Usage scores to avoid getting information about wrong games or outdated pricing. Those researching such current events would prioritize accuracy-rated models over pure reasoning models.

This would add a new current LLM benchmark table labeled "General Search Accuracy" with percentage scores. Users could filter/sort models by this metric and by cost, time etc., and Kagi could recommend high-accuracy models when detecting search-type queries. The benchmark could run concurrently with the present reasoning benchmark to provide updates to test models against current events, ensuring the accuracy scores remain relevant and helping users understand which models excel at being search engine replacements versus more creative, technical or reasoning tasks.

kirkmc

The idea is interesting, but 100 questions? That's way too small. you'd need thousands of questions to cover the basic types of searches that people do. For example, asking one question about a stock price or a sports score isn't enough to know how reliable the LLM would be over lots of similar queries. There are too many variables affecting any such factual search that can't be accounted for.

If I'm looking for factual information, I want sources so I can click links to verify that information.

JJ

kirkmc
you very right here. and 100 was just a ballpark figure to get started really. the issue is giving an estimated accuracy to begin with.
throwing a tad of maths, giving a pilot accuracy of 0.5 (to be cautious), though some general knowledge benchmarking puts this more at 65%, and across 8 categories for 20 models:

so with 95% confidence and a margin of error +-5%, about 400 queries per model per category (8 categories in total). evaluating just as "correct" or "incorrect" output. but yeh, I get the issue. thousands per model.
I don't really know what Kagi is testing when they do their benchmarking atm. If they're really interested in the maths etc (I would have THOUGHT, but idk).

kirkmc

JJ As far as I understand, there's a benchmark suite of queries used to rate LLMs. The problem with this is that companies can tweak their LLMs to respond correctly to those queries. I don't know what the solution is, because this needs humans to decide what is correct and what is not.