Ability to mark a site as AI generated and ignore those tagged
Magess A domain you want to block might be might seems to be useful for others.
I think it is a good idea to make it easy to report sites for review by humans at Kagi, perhaps with a short list of reasons to pick from (e.g. spam or malware). Integrating it into the search results would remove the friction of loading up kagifeedback.org, signing in, and submitting a bug report for every spam site.
I see the use case, but ultimately, this is hard to get right. I do not want Kagi or anyone to meddle with my search results. Let me do this myself.
Option: Something like uBlocklist for Kagi (see github, there is alreadya ticket)
If a manual solution is out of the cards:
- These AI generated sites often look similar to one another. You might be able to make a heuristic that checks for telltale stylings and ads.
- These sites generally lack inline screenshots, pictures, videos, or other rich content.
- These sites are often re-hashings of official documentation, for example pydoc. It could be helpful to track official sources of documentation for libraries, and promote those over copycat sites.
- These sites are often rife with ads. It could help to demote results that rely heavily on ads.
Historically, the best way to detect AI generated content was to calculate the document's perplexity. However, this is expensive now that there are dozens of models that can generate content, and some of them require a license to use.
I'd love to hear other's thoughts as well.
steventrouble Do you know an example for a website/service/library that can reliably detect AI generated stuff (by such means)?
As far as I knew such detectors were not very good and might erroneously flag too many valid websites.
- Edited
There's always GPTZero, but I doubt that will work at your scale and it'd probably be too expensive. It has about a 10% false positive rate.
Given you don't want to let users flag domains, those spam detection steps I mentioned above are the best I can think of right now. I'll keep you posted if I think of anything else.
We made a few steps into this direction by working on a LLM content detector. It shall soon be avaialble as an API from us for everyone to try.
It's time to stop the bullshit generators. There are a lot of LLM-regurgitated sites, packed with ads to the brim and tailored to a specific query, no matter who picks them up, Google or Kagi.
I'd like to see some sort of LLM risk score for a site, whether derived from human feedback or not, and the ability to block all results above a certain threshold.
Lately I started noticing really poor quality results, for example I searched for "router bits explained" and one of the top results was on "homesthetics.net" which is entirely AI generated and full of misinformation. It's becoming difficult to find trustworthy content as AI became so easy to deploy to generate junk content, now even with photos scraped from google images or such