User:Dvasil

From Ext4
Jump to: navigation, search

Γιά σας! Welcome to the userpage of Dhimitrios Vasiliadhou! From the days of struggling with the limitation of the 8.3 filename scheme, I am interested in the good implementation of user-specific features of filesystems. The low-level details are important, I never said not, but what the user sees could have effects even more far-reaching.

Contents

Getting ahead

It is easy to get from one extreme to the other. After 8.3, anything goes, so you don't care that the permission of any character except null and slash poses a risk. The character-set agnosticism of most UNIX filesystems is upon soberer look a liability.

I give my utmost thanks to Gary McLellan for copy-editing what is below, without which it would be in the same dreadful English as the rest of the page.

§ 3.4. Filenames

Specification

Character Encoding

Filenames shall use the UTF-8 encoding of the ISO 10646 (Unicode) standard exclusively. Software that writes to the filesystem shall convert characters from any other encoding to this encoding. Software that mounts a different filesystem upon this filesystem shall temporarily convert characters from any other encoding to this encoding for display. Software that mounts a different filesystem upon this filesystem for writing shall confine itself to the US-ASCII range, encoded in either UTF-8 or UTF-16 depending on the different filesystem in question.

Character Repertoire

All printing characters in the Basic Multilingual Plane of the ISO 10646 (Unicode) standard are permitted except for the slash (U+002F). All control and formatting characters in the Basic Multilingual Plane of the ISO 10646 standard are prohibited except those that are allowed in XML markup, according to Unicode Technical Report #20: U+00A0, U+00AD, U+034F, U+0600, U+0601, U+0602, U+0603, U+06DD, U+070F, U+0F0C, U+115F..U+1160, U+180B..U+180E, U+200B, U+200C..U+200D, U+200E..U+200F, U+2011, U+202F, U+2044, U+2060, U+2061..U+2064, U+2FF0..U+2FFB, U+303E, U+FF80, U+FE00..U+FE0F. All characters outside the Basic Multilingual Plane of the ISO 10646 standard are prohibited. Software that mounts a different filesystem upon this filesystem for writing shall confine itself to the 95 printing characters of the US-ASCII standard (U+0020..U+007E), except slash.

Filename Length

The maximum length for a file shall be 85 characters. Software that writes to the filesystem shall enforce this limit. Software that mounts a different filesystem upon this filesystem shall keep longer filenames for display, but shall enforce the limit when writing, even when the different filesystem supports longer filenames.

Rationale

Character Encoding

A uniform character encoding guards against complexity. ISO 10646, being a superset of all other character sets, prevents having to maintain more than one lookup table for conversion. UTF-8 has the advantage of being impervious to the issue of endianness, as well as maintaining ASCII compatibility. Encoding the ASCII range is unchanged in UTF-8, and converted to UTF-16 by simple null-padding.

Character Repertoire

The slash is prohibited by its nature as the directory separator. Control characters have potential for malicious coding, beside the less dangerous but no less problematic issue of their handling by applications. Only those control characters that are not stateful, that do not require additional data and that are not deprecated in the Unicode standard (all as per Unicode TR #20) are suitable for use in filenames. Characters outside the Basic Multilingual Plane of ISO 10646 are prohibited because of their interference with the filename length limit (see next paragraph); this can safely be done because they are rare even as data, so no use of them in filenames can be foreseen.

Filename Length

The UTF-8 encoding of the ISO 10646 standard alots a different number of bytes according to their range: 1 byte for characters in the US-ASCII range, 2 bytes mainly for alphabetic characters (Greek, Arabic), 3 bytes mainly for Han characters but also for the Indic abugidas and some alphabetic scripts (for example Georgian), and 4 bytes for characters outside the Basic Multilingual Plane. The maximum storage for each filename in this filesystem is 255 bytes, which would translate to maximum length of 255, 127, 85 and 63 characters respectively for each range, but in practice characters from multiple ranges can be used in any filename (example: spaces in the name of a music file in Greek). In order to avoid the complexity introduced by maintaining a variable maximum length for filenames, and the user confusion upon enforcing it, the maximum filename length is fixed to what it would be if only 3-byte characters were used: 85 characters. For this reason, characters outside of the Basic Multilingual Plane of ISO 10646 are prohibited, for otherwise the maximum filename length would have to be 63 characters, a limitation not worth the practically nonexistent gains from allowing the use of such infrequent characters.

CD-ROM support

Someone may not think this has something to do with the ext* filesystems, but the CD-ROM filesystem support goes toward a way in this indeed.

It seems Microsoft's standard, Joliet, is more widely supported than Rock Ridge (yes, you now count Mac OS X, but that's recent, because it is UNIX. Go to OS 9 and see how things are there). Joliet leaves nothing to chance, while Rock Ridge imposes POSIX and not anything else, leaving else to implementor and user discretion. So what character set did you encode that Rock Ridge CD again? :-)

A little ear even to the unacquainted goes a long way. To the reader: Ευχαριστώ!

Personal tools