42
  • Edited

dysiode Thanks for the thoughtful comment!

We have all watched this video this morning. Kagi does better as one would expect but there is room for improvement. Note that very few results are actually AI generated - it is just good old content marketing spam. What makes our usual ways of dealing with this (penalizing sites with lots of ads on them) not as efficient, is that these results appear to be content marketing for products - and as such do not have ads on them.

So I think this is a case of rather this one (type of) query doing overall badly, by no means is Kagi overwhelmed with AI slop.

We obviously have keen interest in this topic and are also welcoming community input. Will sticky this for visibility.

    Vlad changed the title to Preventing content marking in search results .
      Vlad stickied the discussion .

        I think more niche queries are likely to be vulnerable, but they're also where search engines become more important tools. I'm sure content marketing and AI are merging into one giant blob of awful.

        If forum posts, Wikipedia, Reddit, Stackoverflow etc. can be consistently ranked above these kinds of content-filled sites in most cases, that might be a good start?

        This is so wild- I'm legitimately here because I just watched the same video and wanted to see if Kagi could do something about this as well. I think at the end of the day, there will be no "solution" against this, but I think there are ways it can be mitigated.

        My thought was, what if a "moderation lists" feature like what Bluesky has was implemented? Or something like that- something community built. Maybe there could be a few community-contributed lists that are specifically for things like the AI slop sites. At that point they could even work for other content people may want to avoid. Contributing to the list (like, voting a website as bad) could lower the result of that website in the list for everyone who uses it.

        I think these websites try and cover so many bases that it would actually make this method effective. It wouldn't be reliant on everyone searching the same thing. Hit from all sides in a way.

        Many websites are now just generated text and contain nothing of value. While it’s not easy to automatically detect these it would be great if users could report them and an option exposed to filter out these sites.

        robin856 changed the title to Filter out sites consisting of ai generated content .
          Merged 3 posts from Filter out sites consisting of ai generated content.

            Thibaultmol Interesting post, gave it a read and few comments down. I do kinda agree with the tracking complaint and I also think I would rather it be less automatic and more deliberate by the community. But I'm glad multiple people are feeling this way.

            Thibaultmol Keep in mind that Kagi already indirectly downranks these sites. A lot of 'ai slop' websites are in it for the money so are littered by ads and trackers, so Kagi automatically ranks those sites down.

            Vlad above pointed out that apparently these specific sites in this thread are sadly not even full of ads due to their nature. It's good that we at least have the ad-ridden sites downranked. Didn't know that coming into this thread πŸ™‚

            goes into how many current mass-market search engines return pages consisting of almost entirely AI slop/marketing for a simple search of "glb file format". Unfortunately Kagi returns many of the same sites for that search.

            This is of course only a single example; the same is true for many other kinds of technical searches.

            As a person trying to find real information produced by humans on the internet, I would like Kagi to automatically severely downrank sites that are known to do this; at best, they would ideally only appear after a break/heading/whatever that warns that all results following it are in this category.

            1) In the 99% case, the user just benefits from some kind of default-block or severe downrank, and can be more confident that the results they get are "real" and human-written, because Kagi takes the time to find these sites and mark them appropriately, and the software automatically pushes them down (or out completely).

            2) In the case where the user is for some reason interested in these sites (possibly for research into those sites specifically), they can disable this automatic blocking/downranking, perhaps on a per-search basis via a "some known AI slop was excluded from the results, click here to show those sites" link.

            Merged 1 post from Downrank (or default-block) known AI slop sites.
              Thibaultmol marked this discussion as a potential duplicate of
              . A moderator will investigate shortly.

                Oh hey, there was already a topic! I guess my title didn't bring this up. But yeah, I am glad this is already getting traction, and I do wonder what kinds of things we might do about this. Catching up on this discussion now πŸ™‚

                  Vlad What makes our usual ways of dealing with this (penalizing sites with lots of ads on them) not as efficient, is that these results appear to be content marketing for products - and as such do not have ads on them

                  These sites "do not have ads on them", because as you say, they are "content marketing for products" -- that is, the entire article is an ad, and (I would bet a large some of money that) usually every search result you could possibly get from these domains is an ad. That's a debatable conclusion, but I come to it because of what I believe about the incentives that drive the creators of these kinds of sites.

                  Because of all that, and my understanding of Kagi's motivations/incentives, I think blanket downranks/default-blocks of these domains is a good start, personally.

                    Is the title of this supposed to say "marketing" rather than "marking"?

                    Thibaultmol changed the title to Preventing content marketing in search results .

                      Totally spitballing here but I've noticed that many of these sort of spam sites typically use the same repetitive language, often to an inhuman degree. Take this page for example (sourced from Google): https://3d-ace.com/blog/glb-files-about-the-format-and-how-to-use-it-for-your-business/

                      This is undoubtedly one of these spam marketing sites, it clearly has a product to sell and and goes on (and on and on and on) about glb file formats in annoying marketer-speak. Looking closer at the paragraph and sentence structure however, a pattern can be seen. Nearly every sentence contains the some variation of the phrase "glb file format"! No human speaks in this pattern and from my uneducated-about-SEO perspective is probably a keyword-stuffing attempt. I've noticed this language pattern on many other sites like it and I've gotten good at just tuning it out.

                      With this in mind, maybe it would be possible to develop some sort of heuristic classifier to identify this sort of speech pattern? Language stuff like this is something modern AI is pretty decent at (I think). Perhaps a model/algorithm could be put together to identify repetitive, near verbatim mentions of the term being searched for. Typically even text about a topic varies up the way in which it says things, which makes this sort of pattern all the more off-putting. Just because of the potential for false positives however, I would downrank or consolidate these pages a la listicles, rather than outright block them.

                      As a disclaimer I'm far from experienced in the realm of search and have no idea if this would work, just putting the idea out there based on the patterns I've noticed with how these sorts of spam sites operate.

                      Edit to add: Assuming this could be made and is effective enough, it could act as a much more scalable solution than community sourced reporting and "whack-a-moleing" new spam domains.

                        RoxyRoxyRoxy At runtime Kagi typically sees only the snippet - not the entire page (as we do not have the crawl of the entire web). And snippet for this looks "fine"?