Preventing content marketing in search results

dysiode · 3 Jan

I'm not sure if this is a "bug" or a "suggestion" or what, but it's on my mind and it's frustrating and I don't have a solution so I don't think it's a "feature" I'm asking for.

So I watched Freya Holmer's video about her experience searching for information about the .glb file format from the perspective of wanting to write a parser for the 3D modeling software they're writing. tl;dw - they searched "glb file format" in Google and 80% of the first page was AI generated marketing pages. Bing was worse.

I immediately wanted to see how Kagi handled the situation and the results definitely left me wanting. It's better, but marginally. The side card has a bunch of good information (because it's just Wikipedia), and there's a few more real results (including the random r/godot post), but it's still mostly AI generated marketing.

"glb file format"

To be clear, I understand that "glb file format" is a high level prompt that you'd expect to get introductory results for, but it's an inherently technical topic and the actual spec was the 20th search result.

It's easier to list the non-generated links from the results that might be useful (in the order they appear):

Wikipedia (obviously, but I also have it pinned)
FileFormat
Reddit
Khronos Spec Page
Loaders.gl Docs

That's it. I also get there's always been the largely unhelpful "what is this random file format" pages, but at least those you knew wouldn't be helpful, the "GLB File: A Comprehensive Guide" marketing pages are much harder to know to avoid.

"glb file format spec"

I can see arguing that "if you want to implement the format you want the spec, so search for it." There are definitely more useful results for "glb file format spec" but they're hidden amongst all the same AI generated marketing pages.

The previous links are higher up here, again in order, links omitted for previous entries:

Wikipedia
FileFormat
Khronos Spec Page
Loaders.gl
Spec Github
StackOverflow
Mirror of old spec, maybe
BabylonJS forum post
Reddit
Just Solve the File Format Problem Wiki (My personal favorite hidden gem that should definitely be high up on the more general result)

I think mostly I expected to have real world information and not a wall of marketing slop. I get that it's a hard problem to solve, and being able to adjust rankings is a great tool in helping make search more useful, but it's unreasonable to expect each individual to derank/block each site that markets using AI.

One thing Freya notes in their video though is the lack of any author in the "blog" posts. Some of that comes down to "corporate marketing blog post" type stuff, but you still see authors listed in older pages, even if it's just "company team."

I had the thought "I guess we'll just have to assume anything published after 2022 is AI slop" and tried filtering by date (the "date" input in the advanced search dialog is misleading, btw, "before date" is really "start date" even though it sounds like it should be "end date") and I actually had some surprising results!

The "correct" date range (to 2022)

https://kagi.com/search?q=glb+file+format&to_date=2022-01-01

This is maybe it's own bug but... all the results are undated, and still have the same AI marketing pages that showed up in the unrestricted search results.

Fully bounded ranges

(1990 - 2022)

https://kagi.com/search?q=glb+file+format&from_date=1990-01-01&to_date=2022-01-01

These results have a lot more forum posts, which are going to be more all over the place, but at least humans explaining things.

(1900 - 2022)

https://kagi.com/search?q=glb+file+format&from_date=1900-01-01&to_date=2022-01-01

The difference between the 1990 results surprised me, it feels a lot more like old Google to me. It's much less forum driven and more company blog focused, but they're obviously human written and tend to have something useful about them, and at least often provide an actual concise meaningful overview of what a glb file is.

I think a medium between the two would be nice, but honestly, I feel a lot more comfortable in the 1900 results. They provide enough of an introduction to what the .glb format is about without going too off the rails like the 1990 results do.

Unfortunately it loses out on having access to Wikipedia and still useful but recent results (like the Just Solve the File Format Problem wiki which was recently updated putting it in 2024)

I hope that provides the shape of my frustrations and hopes and why I feel like it's a "bug" in the search space.

Vlad · 3 Jan

dysiode Thanks for the thoughtful comment!

We have all watched this video this morning. Kagi does better as one would expect but there is room for improvement. Note that very few results are actually AI generated - it is just good old content marketing spam. What makes our usual ways of dealing with this (penalizing sites with lots of ads on them) not as efficient, is that these results appear to be content marketing for products - and as such do not have ads on them.

So I think this is a case of rather this one (type of) query doing overall badly, by no means is Kagi overwhelmed with AI slop.

We obviously have keen interest in this topic and are also welcoming community input. Will sticky this for visibility.

BioTurboNick · 3 Jan

I think more niche queries are likely to be vulnerable, but they're also where search engines become more important tools. I'm sure content marketing and AI are merging into one giant blob of awful.

If forum posts, Wikipedia, Reddit, Stackoverflow etc. can be consistently ranked above these kinds of content-filled sites in most cases, that might be a good start?

Cctoliver94 · 3 Jan

This is so wild- I'm legitimately here because I just watched the same video and wanted to see if Kagi could do something about this as well. I think at the end of the day, there will be no "solution" against this, but I think there are ways it can be mitigated.

My thought was, what if a "moderation lists" feature like what Bluesky has was implemented? Or something like that- something community built. Maybe there could be a few community-contributed lists that are specifically for things like the AI slop sites. At that point they could even work for other content people may want to avoid. Contributing to the list (like, voting a website as bad) could lower the result of that website in the list for everyone who uses it.

I think these websites try and cover so many bases that it would actually make this method effective. It wouldn't be reliant on everyone searching the same thing. Hit from all sides in a way.

Rrobin856 · 3 Jan

Many websites are now just generated text and contain nothing of value. While it’s not easy to automatically detect these it would be great if users could report them and an option exposed to filter out these sites.

Thibaultmol · 3 Jan

I think this is closely related to https://kagifeedback.org/d/4502-opt-in-collaborative-shared-result-ranking
Keep in mind that Kagi already indirectly downranks these sites. A lot of 'ai slop' websites are in it for the money so are littered by ads and trackers, so Kagi automatically ranks those sites down.

Cctoliver94 · 5 Jan

Thibaultmol Interesting post, gave it a read and few comments down. I do kinda agree with the tracking complaint and I also think I would rather it be less automatic and more deliberate by the community. But I'm glad multiple people are feeling this way.

Thibaultmol Keep in mind that Kagi already indirectly downranks these sites. A lot of 'ai slop' websites are in it for the money so are littered by ads and trackers, so Kagi automatically ranks those sites down.

Vlad above pointed out that apparently these specific sites in this thread are sadly not even full of ads due to their nature. It's good that we at least have the ad-ridden sites downranked. Didn't know that coming into this thread

rrwoods · 7 Jan

goes into how many current mass-market search engines return pages consisting of almost entirely AI slop/marketing for a simple search of "glb file format". Unfortunately Kagi returns many of the same sites for that search.

This is of course only a single example; the same is true for many other kinds of technical searches.

As a person trying to find real information produced by humans on the internet, I would like Kagi to automatically severely downrank sites that are known to do this; at best, they would ideally only appear after a break/heading/whatever that warns that all results following it are in this category.

1) In the 99% case, the user just benefits from some kind of default-block or severe downrank, and can be more confident that the results they get are "real" and human-written, because Kagi takes the time to find these sites and mark them appropriately, and the software automatically pushes them down (or out completely).

2) In the case where the user is for some reason interested in these sites (possibly for research into those sites specifically), they can disable this automatic blocking/downranking, perhaps on a per-search basis via a "some known AI slop was excluded from the results, click here to show those sites" link.

rrwoods · 7 Jan

Oh hey, there was already a topic! I guess my title didn't bring this up. But yeah, I am glad this is already getting traction, and I do wonder what kinds of things we might do about this. Catching up on this discussion now

rrwoods · 7 Jan

Vlad What makes our usual ways of dealing with this (penalizing sites with lots of ads on them) not as efficient, is that these results appear to be content marketing for products - and as such do not have ads on them

These sites "do not have ads on them", because as you say, they are "content marketing for products" -- that is, the entire article is an ad, and (I would bet a large some of money that) usually every search result you could possibly get from these domains is an ad. That's a debatable conclusion, but I come to it because of what I believe about the incentives that drive the creators of these kinds of sites.

Because of all that, and my understanding of Kagi's motivations/incentives, I think blanket downranks/default-blocks of these domains is a good start, personally.

rrwoods · 8 Jan

Is the title of this supposed to say "marketing" rather than "marking"?

RoxyRoxyRoxy · 10 Jan

Totally spitballing here but I've noticed that many of these sort of spam sites typically use the same repetitive language, often to an inhuman degree. Take this page for example (sourced from Google): https://3d-ace.com/blog/glb-files-about-the-format-and-how-to-use-it-for-your-business/

This is undoubtedly one of these spam marketing sites, it clearly has a product to sell and and goes on (and on and on and on) about glb file formats in annoying marketer-speak. Looking closer at the paragraph and sentence structure however, a pattern can be seen. Nearly every sentence contains the some variation of the phrase "glb file format"! No human speaks in this pattern and from my uneducated-about-SEO perspective is probably a keyword-stuffing attempt. I've noticed this language pattern on many other sites like it and I've gotten good at just tuning it out.

With this in mind, maybe it would be possible to develop some sort of heuristic classifier to identify this sort of speech pattern? Language stuff like this is something modern AI is pretty decent at (I think). Perhaps a model/algorithm could be put together to identify repetitive, near verbatim mentions of the term being searched for. Typically even text about a topic varies up the way in which it says things, which makes this sort of pattern all the more off-putting. Just because of the potential for false positives however, I would downrank or consolidate these pages a la listicles, rather than outright block them.

As a disclaimer I'm far from experienced in the realm of search and have no idea if this would work, just putting the idea out there based on the patterns I've noticed with how these sorts of spam sites operate.

Edit to add: Assuming this could be made and is effective enough, it could act as a much more scalable solution than community sourced reporting and "whack-a-moleing" new spam domains.