Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President
UPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information. They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.
The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."
Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.
I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json
When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?
Content-Type: application/json; charset=utf-8
I blogged about the "Importance of being UTF-8" about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:
"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"
See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.
If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?
...then load it into any free HEX editor (or even an online one!) I see this:
Note that the part where the ’ was is actually three full bytes! E2 80 99.
Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.
The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.
This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules
zzzzyyyy yyxxxxxx ->
1110zzzz
10yyyyyy
10xxxxxx
Which gives us this
11100010 -> E2
10000000 -> 80
10011001 -> 99
hence, "that’s" is encoded as
74 68 61 74 E2 80 99 73
I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:
that’s
Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.
Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."
Testing, testing, testing, my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)
UPDATE: The vendor said this in the comments. Very well said.
"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."
Text encoding is fun for all ages. Enjoy!
* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
FYI - The link to "Importance of being UTF-8" is dead.
Look around and you can see U+2010 ...
https://twitter.com/#!/travis/status/86851708755513344
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag.
So I'll post it then: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
I blogged about the "<a href="http://The%20Importance%20of%20being%20UTF-8">Importance of being UTF-8</a>" about five years ago.
Except for the URL which is incorrect (should've been "/blog/TheImportanceOfBeingUTF8.aspx") the quotes (i.e. ") should've been "'s. Also the live preview has trouble with < and < so I'm crossing my fingers when posting this comment...
I guess that US developers are a bit less used to handle strings with characters outside of the "common area" between charsets and encodings (0-9, A-Z, a-z, punctuation and such), as there are no diacritical marks (accents and similar) in your everyday language. The problem you describe here only surfaced due to the use of "nice typography" by some word processor (or OS).
In Portugal (where I live), almost any sentence in Portuguese has at least an accented character, and so it's much easier to face a test case where this exact problem arises. However, I should note that not many people in my team are able to tell what's going on at the byte level, and why sometimes these strange characters come up in some given tool output.
I'm about to make a presentation to my team at work about encodings... I really hope I can help to clear up some doubts, and I believe that this will be a perfect example of why the proper handling of character encodings can be so important in a project.
Hanselman's Blog
<a href="http://hanselman.com">Hanselman's Blog</a>
Only way is to put spaces... That's soooooo 1993 :-P
< a href="http://hanselman.com">Hanselman's Blog< /a>
Politics and Tech -- how little each side understands the other....
This has been a Public Service Announcement. Please code responsibly: http://validator.w3.org/
I'm a german developer and i have banged my head millions of times because the wrong character-sets on the database, the text and the browser.
The ä, ö, ü, and the ß are really painful in german texts and even in my name! Only one characterset in the world would be the best. Praise to the Unicode.
More importantly: Was that what was actually typed, or did Office or the Mac or whatever "help" by changing from something small and compatible to something large and incompatible? "Smart quotes" just aren't.
Great write-up, thanks!
I was going to refer to Joel's article, but RobIII (hi rob!) beat me to it.
Why send curly quotes in the first place, though? Everyone should know by now that it'll cause a problem somewhere.
Because as developers, it is our job to make tools that work, no matter what. Real world text is not limited to ASCII either.
"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag." [emphasis mine]
And I say that as a writer with no dev background or responsibilities.
I wasn't asking why anyone should bother fixing character encoding (they should), I wanted to know whether the intent of the author was presumptuously altered in a destructive way.
Also check out this blog post about encoding :The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Console.WriteLine(Encoding.Default.GetString(Encoding.UTF8.GetBytes("’")));
:-)
This is just one among thousands (at least) of problems caused by MS Word (and likely other apps, but that's the big offender) doing this character replacement in the first place.
The funny thing is, it's already an option to toggle this behavior, it's just that the default is to do the replacement.
http://office.microsoft.com/en-us/word-help/change-curly-quotes-to-straight-quotes-and-vice-versa-HA010173242.aspx
As we've learned over the years, DEFAULTS MATTER. If we were to ask anyone involved in this tweet whether they ever intended a 'smart quote' in the sentence instead of a normal apostrophe (you know, that character on the keyboard they actually hit), I'm 100% sure none of them would have said "yes, it was important to have a 'smart quote' included!". More importantly, I'm equally confident that if the tweet creator(s) understood character sets *and* that an app was or had made the change, they would have said 'um, no thanks, keep the regular apostrophe that works in most every character set just fine'
This is the kind of decision where, even if you think smart quotes look better, the marginal cost to the users in the resultant pain has FAR outweighed any marginal style benefit.
Yes, it'd be great to live in a world where everything handled utf-8 just great, and certainly any string-handling apps should throw in some testing with 'odd' characters (esp. these smart quotes, ugh), but IMHO it's important to identify the root cause here as the apps that make such changes without the user asking for it, and the vast majority of the time, not even noticing the replacement happened.
Sure, fixing vNext of Word et al won't really fix this problem, but I think it's just as important of a lesson to take away (liberal on input, conservative on output, where smart quotes violate the latter :) as 'you should test with utf-8.
</rant> :)
"Smart" apostrophes and quotation marks are the very face of Satan in the material world. Whatever numbnut thought them up should be dragged in manacles to the Hague to answer for his crimes.
And I say that as a writer with no dev background or responsibilities.
*Sigh*
I really wish developers and writers (who should know better) would stop calling them "smart" or "curly" apostrophes/quotation marks. They're simply apostrophes and quotation marks. Real ones. Actual ones. The characters ' and " are not actually apostrophes and quotation marks. They're the result of engineers who made the typewriter trying to save space by combining real apostrophes/quotes with primes. They come from a technical limitation in the technology at the time, not from typography (just like the double space after a period).
Primes are used for units of measure. For example 6′ 2″ (shown with real primes).
1) Yesterday I took a code sample from the web and pasted it into SSRS's function evaluator and I was confused as to why the text box was red-underlining a string literal. Turns out the code sample was using "smart" left-right quotes.
2) I just saw my list of pod-casts from "This American Life", and the title on the web:
"Father's Day 2011"
Appears as this on iTunes's track listing:
"#438: Father's Day 2011"
Although now that I think of it - I think that is just a form of HTML encoding that went out as plain text...
Text is hard.
I wonder whether John Boehner went out of his way to insert the "proper" apostrophe. Is he someone who has memoried the keyboard shortcuts? What a thought! (Well, why not? We're not widdling about with typewriters anymore. This is not some bloaty feature of Microsoft Word - when type was set by hand with metal blocks and things, we had distinct opening and closing quotation marks, and it's a bit more pleasant to read. In a lot of languages, they use « Guillemets ».)
Note that the Unicode database labels character 0x0027 as "APOSTROPHE". Also, we call the curly quotes Smart Quotes, because that's what Word (the manufacturer of 90% of the world's curly quotes) calls them.
Sure, x2032/′ and x2033/″ as you have typed are "PRIME" and "DOUBLE PRIME", but doesn't that undermine the point you were making, that x0027 is the prime character?
Yes, it's a combined-purpose character, and it'd be more attractive to many if we could use precise typography, but it isn't yet practical!
Just think of the keyboard that supported all EIGHT types of dash/hyphen! Would that even work on a phone?
In a general-purpose environment, I'd be happier to have an ‽ interrobang, ؟ irony mark, ⚠ warning, ☠ skull & xbones, ☡ caution, ☢ radioactive, ☣ biohazard, ☤ caduceus, and other practical symbols supported before I started putting too much energy into pure æsthetics. Can we have both? Even better.
Eight dashes/hyphens: ‧ hyphenation point (break letters/syllables) ‐ hyphen ⁃ hyphen bullet − minus ‒ figure dash (number separator) – en dash (range, "to") — em dash (parenthetical, break) ― quotation dash (horizontal bar)
"Hello Ivar ├â╞È├óΓé¼┬ªsell,"
Funny coincidence
/ Ivar Åsell
The responsibility lies with the vendor that created this system. They had absolutely all the information they needed to display it correctly and didn't. Would you have complained that everyone needs to learn to write English if a tweet in a different language had come through garbled?
Furthermore, MS Word is a tool for writing documents where an actual apostrophe/quote/etc. is likely to matter. It was not created for editting tweet text where it probably doesn't. It was a smart default for their tool and audience.
The point I was making was that there is another problem: whether a tool presumptuously made a destructive change that the author did not request.
That Word default made sense in a less-connected, Word-centric world, but has caused nothing but trouble as systems are still learning to cope with character encoding, particularly when content is mixed together from multiple sources at different layers. Word certainly doesn't market itself as a narrowly-targeted tool—its very name suggests that it wants to be your gateway to all text entry.
Word isn't a silo. It has to play in a world still grappling with encoding.
<a href="http://larud.net/Blog/archive/2011/07/11/razor-view-engine-amp-unicode.aspx" title"Razor View Engine & Unicode?">Razor View Engine & Unicode?</a>
Chrome browser fails to render the same text in tab.
Thanks
Anuj Pandey
Almost felt like reading a fairy-tale.
Thanks for the wonderful and in-depth narration!
Cheers!
"Word isn't a silo."
Pretty much every program aimed towards word processing or publishing uses curly quotes. So, no, Word isn't a silo - it behaves just as it's expected, given it's culture.
This isn't just about curly quotes, either - that's just a distraction, really. Developers that fail to render curly quotes will also fail to render accented characters. They'll fail to render mathematical symbols (the proper ones, not x/*, +, -, and /.) They'll fail to render Greek, Japanese, Korean, and so on. Should Japanese people type in Romanji to deal with "a world grappling with encoding"? Should Word convert all Japanese text to Romanji when it's copied because some websites aren't written with Japanese people in mind?
If it's purely aesthetic, why should it matter if people use it? Modern software and websites aren't in an ASCII-only silo either.
Comments are closed.
you can see U+2010 is Right Single Quotation Mark.