User talk:Dvasil

From Ext4
Revision as of 06:26, 21 October 2008 by Dvasil (Talk | contribs)

Jump to: navigation, search

Hi, Dhimitrios.

You have an interesting proposal, but I'd like to take issue (respectfully, of course), with some of its points.

A lesser issue is that your character repertoire restrictions still aren't restricted enough. What do you gain from permitting the non-breaking space (U+00A0)? It could only be confused with the regular space (U+0020), which is a disadvantage. The XML suitability guidelines don't fare well as filename guidelines. You don't have line breaks in filenames, and even BiDi text is displayed on a LTR screen. I personally think ZWNJ (for Farsi) is one of the few formatting characters that are really necessary.

There's a much bigger issue, though: your choice of UTF-8. I'm all for exclusive use of Unicode, and the restriction to the ASCII range when dealing with foreign filesystems looks well-reasoned to me too. However, I think UTF-8 has a lot of serious disadvantages:

  1. The use of UTF-8 requires draconian enforcement against illegal sequences. Control characters and slash can be encoded in UTF-8 surreptitiously by using overlong sequences (C0 AF or E0 80 AF for the slash). The security risk is considerable.
  2. Cutting down the available number of character to a third of the bytes is a huge waste. The limit is little better than Joliet (only 21 more characters; people often reach more than that in their filenames, particularly when they save web page titles with source website name and date of authorship).

I think UTF-16 is a much better choice than UTF-8. You rightly note how widely supported Joliet is. You could object Joliet is limited to 64 characters, but that's only because it needs to be compatible with ISO 9660. If you're doing a filesystem proposal from scratch (as I gather is the case here), then you can go for a maximum length of 127 characters. I agree with you on prohibiting characters outside the BMP, though: outlaw all surrogate code points and you're golden. --Captain Obsequious 05:28, 21 October 2008 (UTC)

Welcome Captain!

You know the history? UTF-16 was not safe for the filesystem. You had to re-engineer the whole OS and apps to read and write it natively, otherwise you got a whole bunch of control characters and the slash, as part of nearly every character in existence. You tell me how our Russian friend is going to use the capital letter yu (Ю)? It comes with a free gift of a control character (the logout character, Ctrl-D, no less) and the slash on a reader that isn't configured to cope with UTF-16. Not good.

What about endianness? You think the industry can be trusted to make the right move and decide on byte order, or will we get the same mess as TIFF and the Unicode standard itself? Let us be on the safe side and choose an encoding that doesn't have this issue.

Short of having all 255 bytes (and even of that I am not certain), the limit will always be passed by some user who thinks the filename is where he puts his curriculum vitae. What is reasonable? We can agree 8.3 is not reasonable. If you say the old Mac limitation of 31 characters is unreasonable, then OK too. But when I do a Joliet CD and hit the 64-character barrier, on second inspection it's obvious I've been verbose. So if people complain about 85-character limit, then they are verbose by nature, and extending the limit to 127 characters would not be much more help to them.

I agree with you about the need to cut out illegal UTF-8 sequences. Like I said, though, it's trivial compared to re-engineering everything to work with UTF-16.

Your reasoning against all formatting characters fails to hold with me. If spoofing concerns you, then you have hardly begun work: you are going to have to prohibit Greek lowercase omicron (looks the same as the Latin small o), Greek capital alpha and Cyrillic capital A, and a whole host of those. That would cripple usability. Please. We're not on the IDN list here. Dhimitrios Vasiliadhou 06:26, 21 October 2008 (UTC)

Personal tools