10

Steps to reproduce:
I searched "typescipt query builder":
https://kagi.com/search?q=typescript+query+builder&r=ie&sh=f7ZezEwOUqrTG0cCB1tlXg

For me, in top 3 results, 2 are exactly the same:

Github doesn't care about capitalization in its URLs. Interestingly, it shows the canonical URL in HTML headers: <link rel="canonical" href="https://github.com/Ff00ff/mammoth" data-turbo-transient>

Expected behavior:
Kagi Indexer / Search engine should deduplicate such entries by either:

  • comparing site content between URLs that are the same lowercased,
  • or respecting canonical meta link (NOTE: maybe it does now, the screenshot shows that the result is indexed 2018)

Debug info:
Browser: Safari for macOS
Region: Ireland

    7 months later

    I found a case where there are two identical search results back to back at the top of the search results list. Seems like this should be easy enough to filter out.

    Searched for "x1c screen show more than 8 files print cache" with no lenses, but I probably have Reddit results slightly preferred on the customization.

    • z64 replied to this.

      coreyward The issue is that the links have different query parameters, not shown with your current display options. From Kagi's POV, we don't know what query parameters are safe to remove or deduplicate, so this is hard to solve for in the general case - we could easily break some results if we merge them incorrectly.

        Merged 2 posts from Duplicate search results.

          z64 Saw that but assumed that Reddit set canonical tags, but seems like they're all-in on web components and use a "shreddit-canonical-url-updater" tag to set this instead which doesn't seem standards compatible at all. Ugh.

          • z64 replied to this.

            coreyward We are also not crawling the results ourselves, we only receive basic structured data about the site - usually little more than what you see on your page above. Crawling on-demand just to resolve this would incur significant latency penalties.

            So, for now we rely on user reports to figure out what exceptions are worth our time to manually implement rules for, basically.

              2 months later

              Another example of a duplicate. Looking at the canonical meta tag, the lower-case version should be primary.

                3 months later
                20 days later
                No one is typing