Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should file URLs have opaque hostnames? #599

Open
karwa opened this issue May 15, 2021 · 16 comments
Open

Should file URLs have opaque hostnames? #599

karwa opened this issue May 15, 2021 · 16 comments
Labels
topic: file Aren't file: URLs the best? topic: parser

Comments

@karwa
Copy link
Contributor

karwa commented May 15, 2021

I've been trying to port Chromium's file path <-> file URL utilities to a project conforming to the latest standard.

As far as I have been able to tell (it's a large codebase and I'm not at all familiar with it), Chromium turns Windows UNC paths in to file URLs with hostnames, and those hostnames may include percent-encoding (e.g. \\some computer\foo\bar.txt becomes file://some%20computer/foo/bar.txt).

That is not allowed by this standard: the hostnames of file URLs are domains (they are even encoded with IDNA), and may not contain raw spaces or percent-encoding.

I don't think it is expected that a file URL's host must only be a domain or IP address; it might be better for them to have opaque hostnames, so they can contain percent-encoding and other characters as allowed by whatever mechanism resolves them.

@annevk
Copy link
Member

annevk commented May 17, 2021

@TimothyGu
Copy link
Member

TimothyGu commented May 17, 2021

Node.js's url.pathToFileURL function, when run on Windows, calls domainToASCII on the provided NetBIOS machine name, which would return the empty string if the NetBIOS name has a space. That's probably not the correct behavior, though no one seems to have complained so far.

Would someone be able to test the Windows UrlCreateFromPath function? I'd be curious to know how it treats \\some computer\foo\bar.txt.

This should do.
#include <iostream>
#include <shlwapi.h>
#include <winerror.h>
#include <wininet.h>

int main() {
    char out[INTERNET_MAX_URL_LENGTH + 1];
    DWORD out_len = INTERNET_MAX_URL_LENGTH;
    HRESULT res = UrlCreateFromPathA("\\\\some computer\\vol\\file", out, &out_len, NULL);
    std::cout << res == S_OK << std::endl;
    out[out_len] = '\0';
    std::cout << out << std::endl;
}

RFC 8089 provides some useful background information on file URLs, but unfortunately doesn't consider the question of weird characters in hostname or IDNs. RFC 8089 uses the grammar in RFC 3986 to describe file hosts though, and the reg-name production does not allow spaces to exist but does allow percent-encoded bytes.

I'd be inclined to allow some sort of opaque hostnames for file URLs as they already have a lot of exceptions in the spec, and hostname in file URLs is basically a legacy feature to support Windows anyway.

@TimothyGu
Copy link
Member

I also found Microsoft's naming conventions for NetBIOS computer names. Unfortunately it doesn't describe exactly whether spaces are allowed (it's not an alphanumeric character which is explicitly allowed, but also doesn't appear in the list of disallowed characters). This ambiguity is also documented on Microsoft's page NetBIOS Name Syntax.

However, RFC 1002 has an example for the NetBIOS name "FRED " with the space at the end, which suggests that spaces are indeed allowed in the protocol at least.

@TimothyGu
Copy link
Member

Interestingly, Chrome actually doesn't treat file URLs differently from http/https. That is, it allows both percent-encoding and IDNs in URLs:

new URL('file://félicit ations.fr/test.txt').href // ⇒ file://xn--flicit%20ations-bnb.fr/test.txt
new URL('https://exam ple.com').href              // ⇒ https://exa%20mple.com/

That's certainly one way around giving file URLs special handling, but no one other than Chrome seems to support this…

@karwa
Copy link
Contributor Author

karwa commented May 31, 2021

So I finally had time to set up a Windows VM, and I've found that URLCreateFromPathA does indeed produce URLs with percent-encoding in the hostname:

\\some computer\vol\file -> file://some%20computer/vol/file
\\?\vol\file -> file://%3F/vol/file

Unfortunately, any browser or other application which conforms to this standard would fail to parse these URLs.

Since NetBIOS hostnames are documented as being case-sensitive, we need to be conservative and preserve the hostname as it was given (i.e. no IDNA). It might be okay to detect and canonicalise IP addresses, but I'm not sure, and in the worst case we can just preserve it in the path and let the system deal with it.

So yes, I'm quite sure that file URLs must support opaque hostnames. If we can find a reliable way to decide that it should be interpreted as an IP address or domain, that would be a nice improvement, but they at least need to support percent-encoding and uppercase characters.

@alwinb
Copy link
Contributor

alwinb commented Jul 20, 2021

Just as an aside, the standard does ask to parse file hosts as opaque (non-special) in step 3.1. of the file host state. But there are some tests that disagree.

@karwa
Copy link
Contributor Author

karwa commented Jul 21, 2021

@alwinb I don't think that is true - whilst it does say "Let host be the result of host parsing buffer with url is not special.", that is the same formulation used in the regular host/hostname state. "url is not special" is a link, or a computed property of url; it does not mean that you should parse the hostname as though the URL were not special. For further confirmation, see the scheme/host matrix.

When I first tried to implement the spec, it took me far longer to parse that line than I'd like to admit. It's very awkwardly worded.

@alwinb
Copy link
Contributor

alwinb commented Jul 21, 2021

Aah now I see! That makes sense, thank you.

@dandclark
Copy link

I am doing some work in the Chromium project to try to bring the handling of space characters more in line with the URL parsing standard, and to complete that it would be very helpful if we had an idea of the way this issue is likely to be resolved.

As @TimothyGu noted here, Chromium allows spaces in hostnames for both http/https and in file URLs. Ideally we would change the Chromium behavior to match other browsers by banning spaces for all special URLs. However, this is risky to do for file because of the issue raised in this thread of potential spaces in NetBIOS names; and if this issue is resolved to allow file URLs to have opaque hostnames, then changing Chromium to treat space as error in file URLs could be a wrong step.

I've been trying to work around this by drafting a Chromium change that only disallows spaces for non-file special URLs. However the complexity of scoping the change in that way is higher than we'd like, especially since we're not sure how file URLs will end up being handled based on the resolution of this issue.

So to decide how we should proceed in bringing Chromium closer to standards compliance, it would be helpful if this issue could be moved towards resolution. Would it be possible for the URL standards experts to comment on which way this should go?

@domenic
Copy link
Member

domenic commented Aug 22, 2024

I want to be clear I am speaking as neither a URL standard editor nor in my capacity as a Chromium engineer. So I don't have any relevant decision power for this. But maybe I am a URL standard expert...

I personally find the arguments in this thread compelling that the current standard is not good. I know Chromium engineers (/cc @ricea @hayatoito) have already been reluctant to change file: URL handling (e.g. excluding it from Interop 202X efforts) and I suspect this sort of issue would not help them overcome that reluctance.

It's less clear to me what the path forward is. If I understand the issue correctly, we have a few options:

  • Treat file: URL hostnames as domains, which will e.g. canonicalize IP addresses (maybe good?), but also disallow spaces (bad).
  • Treat file: URL hostnames as opaque, which will not canonicalize IP addresses, will allow spaces... what other consequences, good or bad?
  • Treat file: URL hostnames the same as http:/https:, per the comment on Should file URLs have opaque hostnames? #599 (comment). This will canonicalize IP addresses and allow spaces, but also do punycode stuff and case-folding, which I think is bad?

I'm hearing that on balance @karwa believes that treating file: URL hostnames as opaque is the best option, even if that means we don't get canonicalization for IP addresses. Do others agree with that?

@annevk, how do you feel about this issue, especially given WebKit's position as a browser that has successfully shipped the URL Standard's parsing?

@karwa
Copy link
Contributor Author

karwa commented Aug 22, 2024

When it comes to host canonicalisation in file URLs, it's worth pointing out that this is only relevant on Windows, and while I expect it is could be useful for IPv4, the canonicalisation by the standard isn't even correct on Windows. I can give two examples:

  1. Stripping localhost in a file URL can be destructive. See Preserve "localhost" in file URLs? #618
  2. Windows also allows IPv6 addresses to be specified using a DNS-compatible syntax because their path syntax does not allow the colons in IPv6 addresses. See: https://en.wikipedia.org/wiki/IPv6_address#Literal_IPv6_addresses_in_UNC_path_names

With regards to the latter, file://fe80::1ff:fe23:4567:890a/... and file://fe80--1ff-fe23-4567-890a.ipv6-literal.net/... point to the same host according to the OS, and your URL may get converted in to the latter form by converting to/from a file path. If applications want to robustly handle IPv6 addresses as hostnames in file paths on Windows, they will need some custom logic to parse/normalise this anyway.

So yes, I think it's better to just say these things are opaque from the URL standard's perspective, which allows for this kind of implementation-defined/platform-specific logic.

I also think we should document how browsers and other applications are supposed to convert file URLs <-> paths on the major operating systems, since there's lots of divergence there as well. That's filed as whatwg/fetch#1338

Treat file: URL hostnames the same as http:/https:, per the comment on #599 (comment). This will canonicalize IP addresses and allow spaces, but also do punycode stuff and case-folding, which I think is bad?

This would not allow spaces. https://example com fails to parse.

(The comment showed that happening in Chrome because Chrome's handling of spaces is yet to be aligned with this standard. It fails on the live viewer: Live viewer)

@achristensen07
Copy link
Collaborator

I wouldn't mind changing new URL("file://host with spaces/path") from a parsing failure to returning file://host%20with%20spaces/path.

@valenting
Copy link
Collaborator

I wouldn't mind changing new URL("file://host with spaces/path") from a parsing failure to returning file://host%20with%20spaces/path.

Percent encoding spaces (and potentially other characters) in file hostnames is something that appeals to me.
We haven't yet shipped file hostnames in Firefox yet, but we'll be watching this issue as we make progress on the implementation.

@dandclark
Copy link

@annevk, how do you feel about this issue, especially given WebKit's position as a browser that has successfully shipped the URL Standard's parsing?

@annevk friendly ping, did you have any thoughts on this?

@annevk
Copy link
Member

annevk commented Sep 25, 2024

As Alex said above it might be okay to change this. My main worry is that we still have quite a few file: URL issues outstanding and I'd rather tackle them comprehensively in one go. Especially if that came with some guarantee that we'd then all try to align on those changes and never revisit it again (modulo deployment fallout).

aarongable pushed a commit to chromium/chromium that referenced this issue Oct 8, 2024
The standards-compliant behavior is to forbid space characters
in URLs, but currently Chromium allows them.

This causes Chromium to fail several tests included in the
Interop2024 'HTTPS URLs for WebSocket' [1] and 'URL' [2] focus
areas.

A difficulty with removing support for spaces altogether is that
they are used in the host part in Windows file: URLs [3].

So in this CL, forbid spaces only for non-file URLs. Since the
scheme is not available during host parsing, this requires some
plumbing changes, adding a new CanonMode to special-case file: URLs.

In order to ensure that behavior for file: URLs is not changed,
it was necessary to audit all callers of the URL canonicalization
functions that could be passed a file: URL. The results of this are
documented in [4]. In most cases the callers do not use file: URLs.
In a few cases it was necessary to add a special check to handle file
URLs.

The URLPattern code is a special case in that there was previously
an attempt to implement it with more standards-based URL behavior;
a call to ContainsForbiddenHostnameCodePoint [5] prevents URLPatterns
being created for hostnames containing space characters,
regardless of URL scheme. However this check is not applied
consistently, so in some scenarios URLPatterns can succeed in
matching against URLs with invalid characters; see the new test cases
added to urlpatterntestdata.json in [6]. That issue will be fixed
for spaces when this change is enabled, and applying this new behavior
even for file: URLs should be a smaller compat risk for URLPattern
since URLPattern already attempts to exclude spaces for all URLs in
most cases.

In this CL, the changes are kept behind a flag. See [6] for the full
test fallout from enabling the flag. In order to include some test
coverage along with this CL, the flag is enabled in a
ScopedFeatureList in url_canon_unittest.cc.

[1] https://wpt.fyi/results/websockets?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2024-websockets
[2] https://wpt.fyi/results/url?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2023-url
[3] whatwg/url#599
[4] https://docs.google.com/document/d/1VwnOOCa2LZR6hsqz64m-VrJMiWmPLHl_Cbfy1XS3JaA/edit?usp=sharing
[5] https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern/url_pattern_util.cc;l=142;drc=b7979b7a71fa422fd46d72f6c4a6efd6bd121863
[6] https://chromium-review.googlesource.com/c/chromium/src/+/5753305

Bug: 40256677, 40124263, 325979102
Change-Id: I0ccd3a4791b89774ec7027a1d8d5910665955c9c
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5908422
Reviewed-by: Ted Choc <tedchoc@chromium.org>
Reviewed-by: Hayato Ito <hayato@chromium.org>
Reviewed-by: Maks Orlovich <morlovich@chromium.org>
Commit-Queue: Dan Clark <daniec@microsoft.com>
Cr-Commit-Position: refs/heads/main@{#1365710}
ubonass pushed a commit to ubonass/google_url that referenced this issue Nov 10, 2024
The standards-compliant behavior is to forbid space characters
in URLs, but currently Chromium allows them.

This causes Chromium to fail several tests included in the
Interop2024 'HTTPS URLs for WebSocket' [1] and 'URL' [2] focus
areas.

A difficulty with removing support for spaces altogether is that
they are used in the host part in Windows file: URLs [3].

So in this CL, forbid spaces only for non-file URLs. Since the
scheme is not available during host parsing, this requires some
plumbing changes, adding a new CanonMode to special-case file: URLs.

In order to ensure that behavior for file: URLs is not changed,
it was necessary to audit all callers of the URL canonicalization
functions that could be passed a file: URL. The results of this are
documented in [4]. In most cases the callers do not use file: URLs.
In a few cases it was necessary to add a special check to handle file
URLs.

The URLPattern code is a special case in that there was previously
an attempt to implement it with more standards-based URL behavior;
a call to ContainsForbiddenHostnameCodePoint [5] prevents URLPatterns
being created for hostnames containing space characters,
regardless of URL scheme. However this check is not applied
consistently, so in some scenarios URLPatterns can succeed in
matching against URLs with invalid characters; see the new test cases
added to urlpatterntestdata.json in [6]. That issue will be fixed
for spaces when this change is enabled, and applying this new behavior
even for file: URLs should be a smaller compat risk for URLPattern
since URLPattern already attempts to exclude spaces for all URLs in
most cases.

In this CL, the changes are kept behind a flag. See [6] for the full
test fallout from enabling the flag. In order to include some test
coverage along with this CL, the flag is enabled in a
ScopedFeatureList in url_canon_unittest.cc.

[1] https://wpt.fyi/results/websockets?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2024-websockets
[2] https://wpt.fyi/results/url?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2023-url
[3] whatwg/url#599
[4] https://docs.google.com/document/d/1VwnOOCa2LZR6hsqz64m-VrJMiWmPLHl_Cbfy1XS3JaA/edit?usp=sharing
[5] https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern/url_pattern_util.cc;l=142;drc=b7979b7a71fa422fd46d72f6c4a6efd6bd121863
[6] https://chromium-review.googlesource.com/c/chromium/src/+/5753305

Bug: 40256677, 40124263, 325979102
Change-Id: I0ccd3a4791b89774ec7027a1d8d5910665955c9c
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5908422
Reviewed-by: Ted Choc <tedchoc@chromium.org>
Reviewed-by: Hayato Ito <hayato@chromium.org>
Reviewed-by: Maks Orlovich <morlovich@chromium.org>
Commit-Queue: Dan Clark <daniec@microsoft.com>
Cr-Commit-Position: refs/heads/main@{#1365710}
NOKEYCHECK=True
GitOrigin-RevId: dc916849a599a4864ef5e140af15932bff94cf9f
@annevk
Copy link
Member

annevk commented Dec 3, 2024

As an update, driven by the exciting increase in interop as part of Interop 2024 (thanks everyone for your efforts!), I think we can resolve #599 (this issue), #618 (preserve localhost), and #805 (permit @ and spaces, though escape them) together.

Some thoughts on the specifics:

  • We should continue to parse IPv6, but don't try to normalize "DNS-based" IPv6 to it. RFC 8089 expects IPv6 and we have it for all URL schemes so far. (I hope @karwa forgot to include [ and ] above and there's not file: URL specific IPv6 syntax on Windows.
  • I would be okay with trying to detect IPv4 and parsing it in the exact same way we do today for special schemes. In fact, now that we are all much more aligned on non-special schemes we could consider doing that across all of them for IPv4. (This might be too ambitious and maybe we should not change this aspect of opaque host parsing at all, be it for file: or other schemes. Open to input.)
  • In the "file host state", we should percent-encode @ and space before appending them to buffer. We should evaluate if there are other code points that need a similar treatment.
  • I'm a little bit concerned about localhost. According to https://www.w3.org/Bugs/Public/show_bug.cgi?id=26387 Internet Explorer did strip localhost. Perhaps this only happened under limited circumstances? I would love input from @dandclark on this what would be best for Windows. Also looping in @SimonSapin as the original reporter.

I think we can even make changes here before Interop 2024 is frozen (February 2025) as all the file: URL tests should be excluded, so thoughts welcome and let's see if we can finally settle file: URL hosts.

That still leaves several Windows drive letter issues that I think we should also seek to resolve those, but tackling those independently seems okay. @dandclark if you could generally review all topic: file Aren't file: URLs the best? issues from a Microsoft/Windows perspective I'd greatly appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: file Aren't file: URLs the best? topic: parser
8 participants