User talk:Dvasil

From Ext4
Jump to: navigation, search

Hi, Dhimitrios.

You have an interesting proposal, but I'd like to take issue (respectfully, of course), with some of its points.

A lesser issue is that your character repertoire restrictions still aren't restricted enough. What do you gain from permitting the non-breaking space (U+00A0)? It could only be confused with the regular space (U+0020), which is a disadvantage. The XML suitability guidelines don't fare well as filename guidelines. You don't have line breaks in filenames, and even BiDi text is displayed on a LTR screen. I personally think ZWNJ (for Farsi) is one of the few formatting characters that are really necessary.

There's a much bigger issue, though: your choice of UTF-8. I'm all for exclusive use of Unicode, and the restriction to the ASCII range when dealing with foreign filesystems looks well-reasoned to me too. However, I think UTF-8 has a lot of serious disadvantages:

  1. The use of UTF-8 requires draconian enforcement against illegal sequences. Control characters and slash can be encoded in UTF-8 surreptitiously by using overlong sequences (C0 AF or E0 80 AF for the slash). The security risk is considerable.
  2. Cutting down the available number of character to a third of the bytes is a huge waste. The limit is little better than Joliet (only 21 more characters; people often reach more than that in their filenames, particularly when they save web page titles with source website name and date of authorship).

I think UTF-16 is a much better choice than UTF-8. You rightly note how widely supported Joliet is. You could object Joliet is limited to 64 characters, but that's only because it needs to be compatible with ISO 9660. If you're doing a filesystem proposal from scratch (as I gather is the case here), then you can go for a maximum length of 127 characters. I agree with you on prohibiting characters outside the BMP, though: outlaw all surrogate code points and you're golden. --Captain Obsequious 05:28, 21 October 2008 (UTC)

Welcome Captain!

You know the history? UTF-16 was not safe for the filesystem. You had to re-engineer the whole OS and apps to read and write it natively, otherwise you got a whole bunch of control characters and the slash, as part of nearly every character in existence. You tell me how our Russian friend is going to use the capital letter ya (Я)? It comes with a free gift of a control character (the logout character, Ctrl-D, no less) and the slash on a reader that isn't configured to cope with UTF-16. Not good.

Dhimitrios, you can't point to the success of Joliet and then claim it can't be applied to Unix-like systems. Linux has been able to mount UTF-16 filesystems for years - not just Joliet but NTFS too, at least for reading in the latter case. Filesystem operations are transparent, and there've been no problems. --Captain Obsequious 06:40, 21 October 2008 (UTC)

What about endianness? You think the industry can be trusted to make the right move and decide on byte order, or will we get the same mess as TIFF and the Unicode standard itself? Let us be on the safe side and choose an encoding that doesn't have this issue.

Speaking of endianness, here again Joliet provides the answer to your argument: Microsoft decided on big-endian, and that was that. If you're doing a design of a filesystem from scratch, you get to decide on a byte order once and for all. Just do it like Joliet, deciding on big-endian UTF-16 with the restriction to the BMP, and you're all set. --Captain Obsequious 06:40, 21 October 2008 (UTC)

Short of having all 255 bytes (and even of that I am not certain), the limit will always be passed by some user who thinks the filename is where he puts his curriculum vitae. What is reasonable? We can agree 8.3 is not reasonable. If you say the old Mac limitation of 31 characters is unreasonable, then OK too. But when I do a Joliet CD and hit the 64-character barrier, on second inspection it's obvious I've been verbose. So if people complain about 85-character limit, then they are verbose by nature, and extending the limit to 127 characters would not be much more help to them.

The way I see it, you don't restrict the filename length more than you really have to. As I said, Joliet had to put the limit at 64 characters because the CD-ROM standard demanded it. The only reason you've decided on the 85-character limit is your unwise choice of UTF-8. Do away with that choice and you get the most you can within the 255-byte limit: 127 characters. More than that would really be the user's unbridled verbosity. --Captain Obsequious 06:40, 21 October 2008 (UTC)

I agree with you about the need to cut out illegal UTF-8 sequences. Like I said, though, it's trivial compared to re-engineering everything to work with UTF-16.

It's not the same. Consider the risk: UTF-8 illegal sequences pass as legitimate ASCII ones, while UTF-16 could never be confused with ASCII. --Captain Obsequious 06:40, 21 October 2008 (UTC)

Your reasoning against all formatting characters fails to hold with me. If spoofing concerns you, then you have hardly begun work: you are going to have to prohibit Greek lowercase omicron (looks the same as the Latin small o), Greek capital alpha and Cyrillic capital A, and a whole host of those. That would cripple usability. Please. We're not on the IDN list here. Dhimitrios Vasiliadhou 06:26, 21 October 2008 (UTC)

Spoofing isn't my concern; I'm worried that you're permitting characters that aren't relevant to filenames. --Captain Obsequious 06:40, 21 October 2008 (UTC)

So just tell me, Captain, why didn't we move everything to UTF-16? Why do we have UTF-8 at all? Are you aware that grep for a filename on a mounted NTFS partition operates on UTF-8, just like it does for a word in a textfile on your local ext3 partition? Please don't be illogical. Even Microsoft hasn't liquidated all its ANSI codepages.

It's a throw of the dice for the success of deciding on byte order. File formats with fourcc's are big-endian (since fourcc's are meant to be human-readable), but some idiot little-endian parser always comes about to screw them up. NTFS partitions are little-endian UTF-16, not like Joliet. The reason you don't see the difference is Linux converts both to UTF-8 for display.

You want out of restrictions? You not being radical enough: you just trash that 255-byte limit and you get still more characters! Say, with 511 bytes per filename, my proposal gives you 170 characters to play with, and UTF-16 gives you 255 characters. Think outside the box!

If you want illegal UTF-8 sequences, you need to write them. More, you need a low-level disk writer, for entering the hex codes just so (C0 AF instead of 2F). It's intentional activity, and you plug it in one place (the UTF-8 encoder and decoder). But UTF-16, unless youve re-engineered each and every application to handle wide characters, you get control characters and slashes just like that, with no malice intended. For you to plug it would be like plugging a "hole" in your fence which is actually all the rest of world not covered by your fence.

If I recall correctly, you said non-breaking space carried the risk of confusion with regular space. This is a spoofing concern most certainly. As for relevance to filenames, all these chars are permitted in XML markup because they affect both display and semantics. You want to call LRM irrelevant? Go talk with an Arabic user who's had the meaning of his text (yes, even in a filename) changed because he can't put an LRM there. The combining grapheme joiner (U+034F) is important for sorting. Etc. I think only with regard to the invisible mathematical operators you have a case. Dhimitrios Vasiliadhou 06:53, 21 October 2008 (UTC)

(Why can't people be bothered to learn how to use indent-quoting? Alright, I'll play along. Sigh...)

I think maximizing interoperability with one fairly influential operating system would be a smart move. I'm not saying all of Linux should be moved to UTF-16, of course. But if you're designing something new, like your filesystem proposal, let's do the smart thing.

You gave just one example. TCP/IP is a successful big-endian standard by design. So is the PNG image format. TIFF and Unicode are a mess, but that's by design too: their designers decided to provide for both byte orders. You're just proving my point about the need to decided on a single byte order. And by the way: the problem with fourCC's you mention is with parsers, not with the fourCC's themselves. You actually say that yourself!

Going out of the 255-byte restriction isn't feasible in the GNU/Linux world anymore than Joliet could have disregarded ISO 9660. You're being either ignorant or disingenuous. The 255-byte is a hard reality; now let's work with it, and while doing that, let's not restrict ourselves more than we need to. The maximum filename length we can get is 127 characters, so that's where we should go.

OK, my concern is partially about spoofing. But only partially, and I'm not going to tell you to stop using Greek small omicron. But I question the use of characters that are much more suited to word processors than character-cell displays. --Captain Obsequious 07:01, 21 October 2008 (UTC)

Have the designers of that fairly influential operating system decided on a uniform encoding? I repeat: NTFS and Joliet don't use the same byte order. You are going to have to translate no matter which system-dominant encoding you use.

See above. If there's more than one way to do something, people will do it more than one way. UTF-8, in terms of endianness, presents only one way to do it, so I prefer it.

The one advantage of additional characters gained, in exchange for a load of disadvantages. And you have still not convinced me that a user who finds 85 characters insufficient will not bang himself into the wall of 127 characters. People who name their files like "This is a summary of the meeting I attended at the Teachers' Assocation on the 31th of May 2008" are uneducated not just in computers but organisation skills in general.

Filenames are usually manipulated nowdays through a GUI filemanager. Even in the character-cell display, though, formatting issues like bidirectional text are visible to the user. Monogloss Americans should not think that all those things are typographical frills. How many Arabic-speaking computer users have you come to know in your work? Dhimitrios Vasiliadhou 07:11, 21 October 2008 (UTC)

Dhimitrios, I don't see any point in trying to convince you further. It's obvious you're set in your thinking and aren't prepared to accept different views. --Captain Obsequious 07:15, 21 October 2008 (UTC)

My duty is to hear what you say. To accept what you say is only my right. You come taking issue with some points of my proposal, and I listen to your arguments. But what can I do that I haven't found your arguments persuasive? Thank you. Dhimitrios Vasiliadhou 07:18, 21 October 2008 (UTC)

Personal tools