Wow. Considering PHP is used primarily for generating HTML, UTF-8 is the de-facto standard encoding for unicode on the web, and PHP files are usually ASCII or UTF-8, going for UTF-16 seems like a phenomenally bad idea. I can see how UTF-16 looked promising back when it was still UCS-2 and there were no surrogate pairs. These days though, UTF-8 and maybe UTF-32 seem to be the realistic choices when working from scratch; UTF-32's advantage in some areas is probably too weak to make it a real contender unless your strings are literally linked lists, not codepoint arrays. (i.e. you don't care that it uses 2-4 times as much memory or storage)
Java's use of UTF-16 is now generally considered a design flaw. It takes more memory than UTF-8, but still has the complexity of variable-length character encoding. UTF-16 was a reasonable choice for Java at the time the decision was made but you would think the PHP developers would learn from that mistake.
As a Java developer, I'm very happy with the compromise that UTF-16 strikes. For day-to-day unicode use, UTF-16 covers all the bases. In the rare cases I need to step outside of UTF-16 to the higher planes, it would work transparently for me up to the point where I start slicing strings naively.
To be honest, I've never actually had to use anything outside of the Basic Multilingual Plane. That's not from lack of breadth in my day-to-day job either - I wrote the first implementation of our web crawler/search engine at DotSpots and had to deal with fetching and indexing pages in languages from most of the BMP (mainly English, Russian and Chinese).
Anyways, I strongly disagree with your assertion that Java was "generally considered" to have "made a mistake" by using UTF-16. It's made my life far easier and the memory costs aren't an issue on today's machines. We don't live in an ASCII world most of the time and storing strings internally as UTF-8 completely ignores this fact.
[edit]
Just so it's clear: UTF-8 only wins when you are representing pure ASCII text. For everything else, it either breaks even or loses. For most Chinese text, UTF-8 is 50% larger:
For characters <= U+0080 (ie: ABC), you win by one byte.
For characters > U+0080 but <= U+07FF (ie: Ȁɐ), you break even at two bytes.
Everything higher than U+07FF in the BMP, < U+10000 (ie: 丂且⬄☃), you lose (by one byte).
For characters >= U+10000 you break even again at four bytes.
If you really want to optimize for ASCII text in Java, there's always byte[] and you're free to wrap a CharSequence around it.
Have you actually worked with code points and surrogate pairs for the java.lang.String class? It's a huge hassle and doing it correctly requires at least triple the amount of code as doing it wrong. The problem is that earlier versions of Java used UCS-2 and so a single real-world character was always represented by a single Java char value. But then the Java maintainers later changed the string representation to UTF-16 so now a single real-world character could take up either one or multiple Java char values. A lot of new code is still being written which implicitly, and incorrectly, assumes that 1 character == 1 char. I even intentionally do it wrong sometimes myself because dealing with code points leads to such code bloat.
I've worked with the codePointAt and various traversal methods, yes. They are a pain to get right and most Java code that uses substrings is broken in this regard.
In general practice, however, this is a non-issue. Characters outside of the Basic Multilingual Plane are not in common use, especially on the web. It's not a perfect programming practice, but it's very pragmatic.
Making everyone pay the development tax of variable-sized characters for any sort of multi-lingual code just means that more code will be written incorrectly.
My team has to write code that will accept and parse data from sources we don't even know about today, and continue working with minimal maintenance over a 15+ year life cycle. I don't think that's an uncommon situation. There's a lot more to software development than just toy web applications.
"In the rare cases I need to step outside of UTF-16 to the higher planes, it would work transparently for me up to the point where I start slicing strings naively."
This is a red herring. Because of combining characters, it is rarely valid to slice between Unicode code points, regardless of encoding. Even in the BMP, a semantic symbol can be composed of multiple code points.
PHP developers don't like to reinvent the wheel. They were trying to use existing libraries that stored strings as UTF-16.
Unfortunately since PHP developers don't like to reinvent the wheel, PHP is also based on a large number of 3rd party libraries that aren't unicode aware.
I recall doing a quick assessment of the UTF encodings recently and coming to a similar conclusion:
UTF-8 is nearly ideal for transmission and storage and is fairly robust for manipulation, albeit it can be the least straightforward to implement(not that app developers actually have to implement it).
UTF-32 is probably most useful as an internal optimization for tasks that can really benefit from a straight scan/cut/paste over even-sized memory cells. You wouldn't store it in your database, but you might want to make use of it in a document editor, for example, to speed up search+replace type operations.
UTF-16 is still substantially more heavyweight than UTF-8, but it can't be optimized into straight memory cells like UTF-32 without breaking the spec. So - unless your needs are extremely specific and you discover a sweet spot in UTF-16 after extensive profiling - it's just not a likely candidate.
One advantage of UTF-16 is that unlike UTF-8, very few characters you encounter in Real Life invoke surrogate pairs. So for the cost of a one-bit flag per string you can assume two bytes per character in your string operations for the overwhelmingly general case.
"So for the cost of a one-bit flag per string you can assume two bytes per character in your string operations for the overwhelmingly general case."
Thank you for that clear and concise explanation of the dangers of using UTF-16.
Yes, I know that wasn't your intention, but it was the end result. One of the most dangerous library failures you can have is a function that works 99.99% of the time. Or in this case, 100% of the time on the input the English-speaking developer provides but distinctly less than 100% in the field.
In this specific case, you can't actually optimize anything because all your optimizations are bugs. You can't just divide by two for character count; that's not an optimization, it's a bug. You can't just multiply by two for a substring operation, because you might chop a character in half, that's a bug. And so on. You'd need a separate type that indicates you've scanned the string to verify it never has split chars and now you might as well be on UCS-2, and that has its own dangers w.r.t. working 99.99% of the time.
Much better to use UTF-8, where the dangers are much more apparent, all you have to do is leave the base ASCII case and you're testing UTF-8. Even I, an English-speaking developer, manage to test that case (once I know it exists, anyhow). There's still ways you can screw up but you're off to a much better start.
Let's face it though, the main string operation in web apps (or most software) is concatenation. I strongly doubt there is any point in converting back and forth between UTF-16 and UTF-8 just for that tiny advantage in addressing individual code points - and even then, not quite 100% of the time.
For search-and/or-replace (or anything that can be done with regexes), I'm pretty sure that UTF-16 has no advantage over UTF-8, as you can make the state machine operate on the bytes directly.
A file is very very very simple to convert once. Tell your developers, "If you don't encode in UTF-16, you will have a performance penalty. Set your file encodings as UTF-16 too. You weren't doing complex internationalization work before, it's really not that big a deal."
I worked with someone who had been on the ICU project, and he argued that UTF-16 is the best compromise for most cases. If you're working primarily in the western character set, UTF-8 is attractive, but that comes at the expense of others.
And frankly, if you don't roll it yourself, what are you going to use other than ICU?
I would guess file encoding is not at all the limiting factor since often times opcode caches mean that the file is only read once anyway. The problem is that you get input for all sorts of other areas like forms, databases, web services, etc. most of which aren't UTF-16.
I think the author has missed the larger problem. The PHP development community is completely dysfunctional. I don't think that a project of the magnitude of PHP 6 is possible without fixing that fundamental problem.
Why is it dysfunctional?
- every discussion leads to bikeshedding (and almost none of the bikeshedders actually commit code to the Zend engine)
- there are 'rules', but they don't apply to most people (ie the 5.4 thing in the article)
- no firm hand to guide them (Rasmus has deliberately not provided this)
- the mailing list has a complete lack of civility
- highest concentration of poisonous people to non-poisonous that I have ever seen
- votes for everything
- patches are not discussed, either pre or post commit, so the code is bad, and people won't work on it.
This is the root cause of the problem. Since it's PHP, you can't get really good developers to work on the core. The people who love PHP and the people who have both the desire and skill to do the work described are separate groups.
Considering that it is as widely adopted as it is though, you'd have to agree that they succeeded in spite of all these hurdles.
PHP is a band aid, but as a band aid it served it's niche remarkably well, imagine if clojure or some other better designed language would attract such an enormous following and would be so easy to deploy.
Even today mod_php runs rings around mod_wsgx in that respect (and it's already a lot better then mod_python).
PHP has tons of shortcomings, but it is relatively good at what it does, and that's what drives it forward, not the people behind the project. Say python and everybody things 'Guido van Rossum', say Clojure and 'Rich Hickey' jumps to the foreground.
As long as I've been using PHP I would have a hard time coming up with the full name of it's lead developer. That 'lack of personality' and the chaotic development process may actually contain some hidden benefit.
Absent a strong leader there will be many people pushing and pulling in different directions, it may have gone too far but there is a lesson in there somewhere.
I'm not discussing the success of PHP. It has done well. However, if you are suggesting that PHP has been successful _because_ of its lack of a leader, I think you would need to justify that.
One of the number one bugs in web apps is assuming that characters can just "flow through" your application, as the article claims is a common case. Sure, if everything is UTF-8, it might work. But the fun comes when some of your data is us-ascii, some is iso8859-1, and some is utf-8. Now treating your data like binary is going to result in a garbled web page. So don't do it; decode data from octets to characters when it comes in your program, manipulate internally as character strings, and encode characters to octets when you output your data. Text is not binary!
And if I were Zed Shaw, this is the part where I'd threaten to kill you if you don't meet my demands.
Actually, the real problem is mixing Windows-1252 and UTF-8, while working with tools that assume your Windows-1252 is really ISO-8859-1. ASCII, after all, is just a subset of UTF-8, so there's no handling required for it if you're already assuming UTF-8.
How on earth did they decide this was a good idea? The points about getting rid of register_globals and safe_mode are great, but why add a feature to a programming language that is highly likely just to result in lots of awful code?
There are ways to use goto effectively... in fact, the other day I was working with an awkwardly nested try/catch that would have read much clearer as a goto (and more efficiently; an empty Exception was being used to trigger the catch).
I've seen this problem "solved" with a do while false and a break, but isn't that even hackier and less expressive?
At this point I think goto-phobia is well understood enough that adding it for occasional use wouldn't ruin the language.
Ruby distinguishes between begin/rescue (analogous to Java's try/catch) and throw/catch. I never understood it until discovering that throw/catch is perfect for dealing with nested cases where rescue or break fail. Also, it works through function calls and isn't interpreted as an error, which makes it great for cases like Ramaze's redirect or render calls; unlike in Rails, they interrupt execution immediately.