Ability to mark a site as AI generated and ignore those tagged

Deamonized

Trying to find information worth reading right now is very difficult given the amount of worthless AI generated ad traps. A way to report sites as AI generated and the ability to hide or flag these in search results would be amazing. If this could be done automatically even better

You make a search for a specific topic, say "best search engines 2024", and see a load of websites offering interesting long read articles, at least on first look. You open one of these sites and it is immediately obvious no human was behind this and that it just contains a boring ChatGPT output designed to get clicks and farm ad views. Overall a massive waste of time. You could now return to the search results and in the options menu for the result, click something like "this looks like AI generated crap" and if the site is flagged enough times or with some sort of confirmation on the search engine side it is either hidden on results, ranked lower or there is a warning banner. This saves everyone's precious time and greatly increases the quality of search results

BeNice

Hiding domains for yourself is already possible, right? Maybe there are some lists of AI spam domains on the Internet, maybe on GitHub

Deamonized and if the site is flagged enough times

I see this immediately getting abused.

Magess

BeNice Hiding domains for yourself is already possible, right?

Yeah, but that doesn't help other people get better search results, which seems like a worthy goal to me.

Fernandez

Magess A domain you want to block might be might seems to be useful for others.

Vlad

Magess It is correct, it can lead to abuse. Flagging your competitor websites to oblivion for example so would require manual internvention for which we do not have resources for.

mhersh

I think it is a good idea to make it easy to report sites for review by humans at Kagi, perhaps with a short list of reasons to pick from (e.g. spam or malware). Integrating it into the search results would remove the friction of loading up kagifeedback.org, signing in, and submitting a bug report for every spam site.

Vlad

mhersh We do not have humans at Kagi who can do this (most humans here write code). So any system has to be automated.

KagiForMe

I see the use case, but ultimately, this is hard to get right. I do not want Kagi or anyone to meddle with my search results. Let me do this myself.

Option: Something like uBlocklist for Kagi (see github, there is alreadya ticket)

steventrouble

If a manual solution is out of the cards:

These AI generated sites often look similar to one another. You might be able to make a heuristic that checks for telltale stylings and ads.
These sites generally lack inline screenshots, pictures, videos, or other rich content.
These sites are often re-hashings of official documentation, for example pydoc. It could be helpful to track official sources of documentation for libraries, and promote those over copycat sites.
These sites are often rife with ads. It could help to demote results that rely heavily on ads.

Historically, the best way to detect AI generated content was to calculate the document's perplexity. However, this is expensive now that there are dozens of models that can generate content, and some of them require a license to use.

I'd love to hear other's thoughts as well.

BeNice

steventrouble Do you know an example for a website/service/library that can reliably detect AI generated stuff (by such means)?
As far as I knew such detectors were not very good and might erroneously flag too many valid websites.

steventrouble

There's always GPTZero, but I doubt that will work at your scale and it'd probably be too expensive. It has about a 10% false positive rate.

Given you don't want to let users flag domains, those spam detection steps I mentioned above are the best I can think of right now. I'll keep you posted if I think of anything else.

Vlad

We made a few steps into this direction by working on a LLM content detector. It shall soon be avaialble as an API from us for everyone to try.

Orhideous

It's time to stop the bullshit generators. There are a lot of LLM-regurgitated sites, packed with ads to the brim and tailored to a specific query, no matter who picks them up, Google or Kagi.

I'd like to see some sort of LLM risk score for a site, whether derived from human feedback or not, and the ability to block all results above a certain threshold.

MightyPork

Lately I started noticing really poor quality results, for example I searched for "router bits explained" and one of the top results was on "homesthetics.net" which is entirely AI generated and full of misinformation. It's becoming difficult to find trustworthy content as AI became so easy to deploy to generate junk content, now even with photos scraped from google images or such