42
Merged 3 posts from Filter out sites consisting of ai generated content.

    Thibaultmol Interesting post, gave it a read and few comments down. I do kinda agree with the tracking complaint and I also think I would rather it be less automatic and more deliberate by the community. But I'm glad multiple people are feeling this way.

    Thibaultmol Keep in mind that Kagi already indirectly downranks these sites. A lot of 'ai slop' websites are in it for the money so are littered by ads and trackers, so Kagi automatically ranks those sites down.

    Vlad above pointed out that apparently these specific sites in this thread are sadly not even full of ads due to their nature. It's good that we at least have the ad-ridden sites downranked. Didn't know that coming into this thread 🙂

    goes into how many current mass-market search engines return pages consisting of almost entirely AI slop/marketing for a simple search of "glb file format". Unfortunately Kagi returns many of the same sites for that search.

    This is of course only a single example; the same is true for many other kinds of technical searches.

    As a person trying to find real information produced by humans on the internet, I would like Kagi to automatically severely downrank sites that are known to do this; at best, they would ideally only appear after a break/heading/whatever that warns that all results following it are in this category.

    1) In the 99% case, the user just benefits from some kind of default-block or severe downrank, and can be more confident that the results they get are "real" and human-written, because Kagi takes the time to find these sites and mark them appropriately, and the software automatically pushes them down (or out completely).

    2) In the case where the user is for some reason interested in these sites (possibly for research into those sites specifically), they can disable this automatic blocking/downranking, perhaps on a per-search basis via a "some known AI slop was excluded from the results, click here to show those sites" link.

    Merged 1 post from Downrank (or default-block) known AI slop sites.

      Oh hey, there was already a topic! I guess my title didn't bring this up. But yeah, I am glad this is already getting traction, and I do wonder what kinds of things we might do about this. Catching up on this discussion now 🙂

        Vlad What makes our usual ways of dealing with this (penalizing sites with lots of ads on them) not as efficient, is that these results appear to be content marketing for products - and as such do not have ads on them

        These sites "do not have ads on them", because as you say, they are "content marketing for products" -- that is, the entire article is an ad, and (I would bet a large some of money that) usually every search result you could possibly get from these domains is an ad. That's a debatable conclusion, but I come to it because of what I believe about the incentives that drive the creators of these kinds of sites.

        Because of all that, and my understanding of Kagi's motivations/incentives, I think blanket downranks/default-blocks of these domains is a good start, personally.

          Is the title of this supposed to say "marketing" rather than "marking"?

          Thibaultmol changed the title to Preventing content marketing in search results .

            Totally spitballing here but I've noticed that many of these sort of spam sites typically use the same repetitive language, often to an inhuman degree. Take this page for example (sourced from Google): https://3d-ace.com/blog/glb-files-about-the-format-and-how-to-use-it-for-your-business/

            This is undoubtedly one of these spam marketing sites, it clearly has a product to sell and and goes on (and on and on and on) about glb file formats in annoying marketer-speak. Looking closer at the paragraph and sentence structure however, a pattern can be seen. Nearly every sentence contains the some variation of the phrase "glb file format"! No human speaks in this pattern and from my uneducated-about-SEO perspective is probably a keyword-stuffing attempt. I've noticed this language pattern on many other sites like it and I've gotten good at just tuning it out.

            With this in mind, maybe it would be possible to develop some sort of heuristic classifier to identify this sort of speech pattern? Language stuff like this is something modern AI is pretty decent at (I think). Perhaps a model/algorithm could be put together to identify repetitive, near verbatim mentions of the term being searched for. Typically even text about a topic varies up the way in which it says things, which makes this sort of pattern all the more off-putting. Just because of the potential for false positives however, I would downrank or consolidate these pages a la listicles, rather than outright block them.

            As a disclaimer I'm far from experienced in the realm of search and have no idea if this would work, just putting the idea out there based on the patterns I've noticed with how these sorts of spam sites operate.

            Edit to add: Assuming this could be made and is effective enough, it could act as a much more scalable solution than community sourced reporting and "whack-a-moleing" new spam domains.

              RoxyRoxyRoxy At runtime Kagi typically sees only the snippet - not the entire page (as we do not have the crawl of the entire web). And snippet for this looks "fine"?

                Vlad Shoot, you're right I totally took for granted having a complete copy of the page text. I would assume Quick Answer is able to retrieve the entire page text to generate its answers, but doing something similar for every search would probably hammer your servers pretty hard. Maybe it can be done with page text sourced from quick answer? Would rely on quick answer usage and therefore be slow at building the "bad page" index but maybe better than nothing. Idk lol you know the intricacies of how this works much better than me obviously.

                  I also believe it'd be too much effort for everyone to maintain such a list of personalized result rankings by themselves. Would be great as an (optional) lens to narrow down search results. Maybe “Clean Web” (building on the existing “Small Web”)?

                  Holmer's video informed my decision to subscribe to Kagi after playing with the free tier. The "Academic" search option has helped somewhat with more technical searches but if I look for anything on commercial software like MS Office, it is slop city (or at least content farm junk.) I think Kagi does better than DDG, my prior default and Google is so far gone I don't bother comparing anymore. Kagi has been a good investment so far.

                  At home I have uBlock Origin and uBlacklist installed on Firefox and Orion (Macbook) which helps me get more out of Kagi. When letsblock.it was active, I benefitted from the community-supplied sloplists for uBO. I found one for uBlacklist and am on the hunt for more. Community-sponsored lists seem to be the way to go but as with so many things, a few people will burn out supporting the vast many.

                  My problem is worse at work. Large company, Windows, locked-down PCs and browser choices, no extensions allowed. I am blocking sites right and left on Kagi but am aware of the 1000 site limit and no wildcards. For example, I think any site with "geek" in the title is 99.5% likely to be trash but unless I have uBlacklist, I have no way to block them all.

                  I think we'll wind up having to use AI to detect AI but danged if I know how to implement it.

                  Finally, I would like to compliment @dysiode. I've been online for nearly forty years. That post reminded me of the old Usenet where people wrote careful, detailed posts on their issues along with what they tried and possible options. That is very rare these days.

                  Vlad I wonder if it's not worth automatically crawling pages that show up in the top ~20 search results at least ~10 times automatically1. It won't help with the first appearances, but it would hopefully prevent Kagi from showing poor quality links over time, and it seems like a more strategic way to build an index than to just try crawling everything.


                  1 Of course, a more dynamic velocity-based approach might be better—keep a dynamically ranked queue of results to crawl based on frequency with a time decay algo (more time since it's shown up in search results means lower priority) and let the crawlers pluck jobs off of the top as they have resources available.

                  • Vlad replied to this.
                    a month later

                    On the topic of doing better crawling, the approach outlined in this article sounds intriguing:

                    Crucially, I would not have [a hypothetical FOSS search engine] crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers. Users who want to find, say, a blog post about a subject rather than the documentation on that subject, would have to be more specific: “$subject blog posts”.

                    Sounds like this approach could work for Kagi to cut out content generated slop, since they would not show up as "authoritative" or "high-quality" sources. Of course, the hard part would be doing this in a scalable way for many types of technical queries.

                    No one is typing