Ability to mark a site as AI generated and ignore those tagged

MMagess · Jan 9, 2024

BeNice Hiding domains for yourself is already possible, right?

Yeah, but that doesn't help other people get better search results, which seems like a worthy goal to me.

FFernandez · Jan 9, 2024

Magess A domain you want to block might be might seems to be useful for others.

Mmhersh · Jan 9, 2024

I think it is a good idea to make it easy to report sites for review by humans at Kagi, perhaps with a short list of reasons to pick from (e.g. spam or malware). Integrating it into the search results would remove the friction of loading up kagifeedback.org, signing in, and submitting a bug report for every spam site.

KKagiForMe · Jan 9, 2024

I see the use case, but ultimately, this is hard to get right. I do not want Kagi or anyone to meddle with my search results. Let me do this myself.

Option: Something like uBlocklist for Kagi (see github, there is alreadya ticket)

Vlad · Jan 10, 2024

mhersh We do not have humans at Kagi who can do this (most humans here write code). So any system has to be automated.

Vlad · Jan 10, 2024

Magess It is correct, it can lead to abuse. Flagging your competitor websites to oblivion for example so would require manual internvention for which we do not have resources for.

steventrouble · Jan 13, 2024

If a manual solution is out of the cards:

These AI generated sites often look similar to one another. You might be able to make a heuristic that checks for telltale stylings and ads.
These sites generally lack inline screenshots, pictures, videos, or other rich content.
These sites are often re-hashings of official documentation, for example pydoc. It could be helpful to track official sources of documentation for libraries, and promote those over copycat sites.
These sites are often rife with ads. It could help to demote results that rely heavily on ads.

Historically, the best way to detect AI generated content was to calculate the document's perplexity. However, this is expensive now that there are dozens of models that can generate content, and some of them require a license to use.

I'd love to hear other's thoughts as well.

BBeNice · Jan 13, 2024

steventrouble Do you know an example for a website/service/library that can reliably detect AI generated stuff (by such means)?
As far as I knew such detectors were not very good and might erroneously flag too many valid websites.

steventrouble · Jan 23, 2024

There's always GPTZero, but I doubt that will work at your scale and it'd probably be too expensive. It has about a 10% false positive rate.

Given you don't want to let users flag domains, those spam detection steps I mentioned above are the best I can think of right now. I'll keep you posted if I think of anything else.

Vlad · Feb 1, 2024

We made a few steps into this direction by working on a LLM content detector. It shall soon be avaialble as an API from us for everyone to try.

Orhideous · Mar 6, 2024

It's time to stop the bullshit generators. There are a lot of LLM-regurgitated sites, packed with ads to the brim and tailored to a specific query, no matter who picks them up, Google or Kagi.

I'd like to see some sort of LLM risk score for a site, whether derived from human feedback or not, and the ability to block all results above a certain threshold.

MightyPork · Nov 12, 2024

Lately I started noticing really poor quality results, for example I searched for "router bits explained" and one of the top results was on "homesthetics.net" which is entirely AI generated and full of misinformation. It's becoming difficult to find trustworthy content as AI became so easy to deploy to generate junk content, now even with photos scraped from google images or such