Skip to content

Conversation

@mbaruh
Copy link
Member

@mbaruh mbaruh commented Dec 12, 2025

No description provided.

@mbaruh mbaruh added t: bug Something isn't working p: 1 - high High Priority a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) t: enhancement Changes or improvements to existing features s: needs review Author is waiting for someone to review and approve labels Dec 12, 2025
@decorator-factory
Copy link
Member

decorator-factory commented Dec 13, 2025

Unquoting the entire URL can change the meaning of the URL quite severely.
For example, http://example.com%2F:%[email protected]/ is a URL with the host of malicious-site.example.org, the username example.com/ and password /. But if you just unquote the entire thing, you get http://example.com/:/@malicious-site.example.org/ which has the hostname of example.com.

Another example: https://example.com/apple/banana/%2e%2e%2f%2e%2e%2fcherry semantically has the path /apple/banana/../../cherry, but if you unquote it, it will have the path /cherry. (this example isn't necessarily relevant here because we are only filtering the domain).

@decorator-factory
Copy link
Member

To summarize some discussion with mbaruh from Discord:

  • Discord likely uses JavaScript's URL class (new URL/URL.parse etc.) to figure out what part of the message should be a hyperlink.
  • JavaScript's URL parser will handle any amount of extra slashes after http:// and discard them, so e.g. http://////////////////example.com is considered the same as http://example.com.
  • RFC 3986 and the WHATWG web standard thing both allow domain names to be percent-encoded, so http://%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com should be interpreted the same as http://банан.com.

For now, we can probably replace http(s?)://+ with http\1://, use yarl.URL to parse the URL and manually percent-decode the host.
(yarl doesn't percent-decode the host, see github discussion)

In the future, since we really want to parse URLs in the same way JavaScript does, we could use something that explicitly parses URLs according to the whatwg rules. There's a whatwg-url package on PyPI (which is "archived", but it's just one file so we can simply vendor it) that seems to fit our purposes:

>>> banana = "http:///////%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com"
>>> whatwg_url.parse_url(banana)
<Url scheme='http' hostname='xn--80aab3cb.com' port=None path='/' query=None fragment=None>
>>>

As an alternative we could use the Rust url package which is maintained by Servo and also implements the WHATWG spec. That has the added bonus of being blazingly fast (hopefully) in case we want to process a lot of URLs (e.g.: if spam bots intentionally put a lot of URLs in a message to try DoSing the spam filters).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a: filters Related to message filters: (antimalware, antispam, filtering, token_remover) p: 1 - high High Priority s: needs review Author is waiting for someone to review and approve t: bug Something isn't working t: enhancement Changes or improvements to existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants