Item 43470180

usrnm • 5 days ago

Not all unicode whitespace characters take up exactly one byte when encoded in utf8. Not even talking about other possible encodings, just good old utf8. Let that sink in a bit, and you'll realize what a can of worms it is in a language where strings are just byte sequences.

criddell • 5 days ago

Because it's tricky is exactly why it should be in the standard library.

The C++ standard library should just incorporate ICU by reference IMHO.

1 reply

account42 • 5 days ago

ICU is an unreasonably large dependency for something that many users won't need. Its behavior also changes with new Unicode versions which makes it incompatible with something that cares as much as backward compatibility as the C++ standard library.

1 reply

criddell • 5 days ago

That’s the nature of Unicode: it’s complicated and a moving target.

As far as it being a large dependency, the beauty of C++ is that if you don’t use it, it won’t affect your build.

If ICU is too large, complex, and unstable for the C++ committee, then regular users don’t stand a chance.

1 reply

account42 • 5 days ago

> As far as it being a large dependency, the beauty of C++ is that if you don’t use it, it won’t affect your build.

That's the theory. In practice, you have things like iostreams pulling in tons of locale machinery (which is really significant for static builds) even if you never use a locale other than "C". That locale machinery will include gigantic functions for formatting monetary amounts even if you never do any formatting.

> If ICU is too large, complex, and unstable for the C++ committee, then regular users don’t stand a chance.

Regular users have more specific requirements and can handle binary compatibility breaks better if those aren't coupled with other unrelated functionality.