Support unicode urls filtering #3450

mbaruh · 2025-12-12T21:19:07Z

No description provided.

bot/exts/filtering/_filters/domain.py

decorator-factory · 2025-12-13T01:34:34Z

Unquoting the entire URL can change the meaning of the URL quite severely.
For example, http://example.com%2F:%[email protected]/ is a URL with the host of malicious-site.example.org, the username example.com/ and password /. But if you just unquote the entire thing, you get http://example.com/:/@malicious-site.example.org/ which has the hostname of example.com.

Another example: https://example.com/apple/banana/%2e%2e%2f%2e%2e%2fcherry semantically has the path /apple/banana/../../cherry, but if you unquote it, it will have the path /cherry. (this example isn't necessarily relevant here because we are only filtering the domain).

decorator-factory · 2025-12-13T05:00:26Z

To summarize some discussion with mbaruh from Discord:

Discord likely uses JavaScript's URL class (new URL/URL.parse etc.) to figure out what part of the message should be a hyperlink.
JavaScript's URL parser will handle any amount of extra slashes after http:// and discard them, so e.g. http://////////////////example.com is considered the same as http://example.com.
RFC 3986 and the WHATWG web standard thing both allow domain names to be percent-encoded, so http://%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com should be interpreted the same as http://банан.com.

For now, we can probably replace http(s?)://+ with http\1://, use yarl.URL to parse the URL and manually percent-decode the host.
(yarl doesn't percent-decode the host, see github discussion)

In the future, since we really want to parse URLs in the same way JavaScript does, we could use something that explicitly parses URLs according to the whatwg rules. There's a whatwg-url package on PyPI (which is "archived", but it's just one file so we can simply vendor it) that seems to fit our purposes:

>>> banana = "http:///////%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com"
>>> whatwg_url.parse_url(banana)
<Url scheme='http' hostname='xn--80aab3cb.com' port=None path='/' query=None fragment=None>
>>>

As an alternative we could use the Rust url package which is maintained by Servo and also implements the WHATWG spec. That has the added bonus of being blazingly fast (hopefully) in case we want to process a lot of URLs (e.g.: if spam bots intentionally put a lot of URLs in a message to try DoSing the spam filters).

mbaruh added 2 commits December 12, 2025 23:11

Support unicode URLs in domain filtering

472f539

Replace deprecated tldextract registered_domain

5b81dbe

L3viathan reviewed Dec 12, 2025

View reviewed changes

bot/exts/filtering/_filters/domain.py Show resolved Hide resolved

L3viathan approved these changes Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support unicode urls filtering #3450

Support unicode urls filtering #3450

Uh oh!

mbaruh commented Dec 12, 2025

Uh oh!

Uh oh!

decorator-factory commented Dec 13, 2025 •

edited

Loading

Uh oh!

decorator-factory commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Support unicode urls filtering #3450

Are you sure you want to change the base?

Support unicode urls filtering #3450

Uh oh!

Conversation

mbaruh commented Dec 12, 2025

Uh oh!

Uh oh!

decorator-factory commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

decorator-factory commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

decorator-factory commented Dec 13, 2025 •

edited

Loading