Filter out or mark punycode domains

nyjan

Kagi currently serves punycode domains in the search results like any other domain. This enables attackers to impersonate well-known domains.

Here's an example:

The query to find this result in the results list is "punycode demo".

Kagi decodes and formats the domain name correctly, including punycode characters, which is not always helpful in this special case.

I have the following suggestions how one might go about resolving this issue:

Allow users to filter out all punycode domains via a dedicated toggle in the settings. This would only require filtering out all domains beginning with "xn--"
Since internationalized domain names might include punycode for legit reasons (like having regional characters in the name) users might prefer to not filter out all such domains. In this case, a search result should display both the decoded and undecoded domain name:

bücher.tld (xn--bcher-kva.tld)

or even better:

xn--bcher-kva.tld (bücher.tld)

Punycode domains that include both latin characters and characters commonly (mis-)used for IDN homograph attacks could be marked with an exclamation mark icon or something similar to alert users of possible deception taking place

Of course, these are only the ideas I came up with in a few minutes, maybe there are better ways to handle this.
The domains in question should also rarely show up in search results because of their low relevance but still: I think this is an issue for less experienced users.

Vlad

nyjan Do you have any examples how other search engines handle this?

nyjan

Vlad

Sure, here are some examples:

Google/Bing/Startpage: Currently have the same behavior as Kagi (no special handling)

This is also the behavior of Ecosia and Qwant but I did't want to include more screenshots
DuckDuckGo: Does not decode punycode
Brave Search: Does not decode punycode and seemingly heavily downrates punycode domains, or at least some of them (I could only get an example result by searching for the undecoded domain name, it would not show up for any organic query I could come up with)
SearX(NG): Uses format provided by underlying search engine (amusingly on qwants own website the name is decoded, in contrast to their results in aggregators, which are undecoded)
Nona: Decodes the domain name but seemingly uses a font where homograph attacks are easier to perceive. I do not know if this is intentional or a happy accident.

blosh

Would it perhaps be feasible to make a translator (using a dictionary of homoglyphs and their corresponding latin characters)?

If a punycode domain translates to another known non-punycode domain with significantly higher rating, it would trigger a notice for the user.

warpspin

Somewhat on-topic, the algorithm Chromium uses to handle the problem: https://chromium.googlesource.com/chromium/src.git/+/master/docs/idn.md