CSI: Visual Studio - Unable to translate Unicode character at index X to specified code page
A customer emailed me a weird one. I tend to have a sense for when something is up and when an obscure thing will turn into something interesting.
The person says:
...mysteriously most of my projects refuse to build. "The build stopped unexpectedly because of an internal failure... something about unicode... blah blah"
There are a few messages out there on the web about it -- even a really old hot fix. What's the best way to proceed with the VS team / MS? Is there anyone actively interested in glitches like this?
My spidey-sense is tingling. First, when something says "internal failure" it means some fundamental expectation wasn't met. Garbage in perhaps? He says "most of my projects" which implies it's not a specific project. There's also the sense that this is a "suddenly things stopped working" type thing. Presumably it worked before.
I say:
"Have you checked all the source files to make sure one isn't filled with Unicode nulls or something?"
And says no, but sends a call-stack (which is always nice when it's sent FIRST, but still):
Error 1 The build stopped unexpectedly because of an internal failure.
System.Text.EncoderFallbackException: Unable to translate Unicode character \uD97C at index 1321 to specified code page.
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
at System.Text.UTF8Encoding.GetByteCount(Char* chars, Int32 count, EncoderNLS baseEncoder)
at System.Text.UTF8Encoding.GetByteCount(String chars)
at System.IO.BinaryWriter.Write(String value)
at Microsoft.Build.BackEnd.NodePacketTranslator.NodePacketWriteTranslator.TranslateDictionary(Dictionary`2& dictionary, IEqualityComparer`1 comparer)
at Microsoft.Build.Execution.BuildParameters.Microsoft.Build.BackEnd.INodePacketTranslatable.Translate(INodePacketTranslator translator)
at Microsoft.Build.BackEnd.NodePacketTranslator.NodePacketWriteTranslator.Translate[T](T& value, NodePacketValueFactory`1 factory)
at Microsoft.Build.BackEnd.NodeConfiguration.Translate(INodePacketTranslator translator)
at Microsoft.Build.BackEnd.NodeProviderOutOfProcBase.NodeContext.SendData(INodePacket packet)
...
OK, so it doesn't like a character. But a character in WHAT? Well, we'd assume a source file, but it's important to remember that there's other pieces of input to a compiler like path names, environment variables, commands passed to the compiler as switches, etc.
It says Index 1321 which seems pretty far into a string before it gets mad. I asked a few people inside and Sara Joiner says:
It looks like the only place in BuildParameters that we call TranslateDictionary is when transferring the state of the environment [variables] across the wire.
Ah, so this is splitting up name-value pairs that are the environment variables! David Kean says "ask him what his PATH looks like." I ask and I get almost 2000 bytes of PATH! It's a HUGE path, it looks like it may even have been duplicated and appended to itself a few times.
Here's just a bit of the PATH in question. See anything?
\;C:\PROGRA~1\DISKEE~1\DISKEE~1\;C:\Program Files (x86)\Windows Kits\8.0\Windows
Performance Toolkit\;C:\Program Files\Microsoft SQL
Server\110\Tools\Binn\;C:\Program Files\Microsoft\Web Platform
Installer\;C:\Program Files\TortoiseSVN\binVN\???p??;C:\Program
Files\TortoiseSVN\bin;C:\PHP\;C:\progra~1\NVIDIA
Corporation\PhysX\Common;C:\progra~2\Common Files\Microsoft Shared\Windows
Live;C:\progra~1\Common Files\Microsoft Shared\Windows
Live;C:\q\w32;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;C:\progra~2\WIDCOMM\Bluetooth
Software\;C:\progra~2\WIDCOMM\Bluetooth
See those ??? marks? That doesn't feel like question marks to me. I open the result of "SET > env.txt" as a binary file in Visual Studio and it looks like it's 3Fs, which are ? marks.
This makes me think that there's unicode goo in the PATH that was converted to ANSI with it was piped. Phrased differently, this text file isn't reality.
However, elsewhere in the Windows UI his PATH variable looks like different.
C:\Program Files\TortoiseSVN\binVN\�侱ᤣp䥠؉;
Sometimes that corruption in the path looks like this and you might assume it's Chinese. No, it's corruption that's getting interpreted as Unicode. Interestingly the error said the naughty character was 0xD97C which is �xD97C; � which implies to me that something got stripped out at some point in processing and turned into the Unicode equivalent of 'uh...' Regardless, it's wrong and it needs to be removed.
I ask him if cleaning his PATH worked and the customer just send me a one line response via email...the best kind of response:
========== Build: 12 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
Yay! I hope this helps the next person who goes aGoogling for the answer and thought they were alone. Thanks to David Kean, Sara Joiner and Srinivas Nadimpalli for looking at the call stack and guessing at solutions with me!
Any insights, Dear Reader?
Sponsor: Big thanks to RedGate for sponsoring the feed this week! Check out Deployment Manager – app deployment without the stress. Deploy .NET code & SQL Server databases in one simple process from a web-based UI. Works with local, remote and cloud servers. Try it free.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
Fortunately, that's what ControlSet### are for.
Yes. If the path is garbage the build should fail. Because there is no way for the system to know if it's creating the correct output when part of the path is unreadable.
Otoh the error message should state clearly that the path is the one to blame.
1. My first suspect is TortoiseSVN.
The problem shows in between two TortoiseSVN entries, and the binVN seems to suggest the traces of yet another SVN before.
TortoiseSVN\binVN\???p??;C:\Program Files\TortoiseSVN\bin
Circumstancial evidence, I agree. But many applications ported from Linux (or Linux developers) tend have problems with "char* is UTF-8" and "what the heck is this wchar_t abomination"
2. 0xD97C is indeed half a surrogate pair, but I would argue that nothing should not fail.
The path is valid from the file system perspective (as NTFS is not surrogate-aware). I can really have that thing on disk, and things would run from there.
And for any application working with it using only wchar_t it would be no problem either. The only reason for this failure is that VS tried to convert it to UTF-8 (see UTF8Encoding).
The result is invalid? Yes and no. Application A should not fail because application B did something stupid. Normal path processing would be: split at ;, check each entry and see if it exists on disk, and if I can run something from there, ignore if not.
3. "character was 0xD97C which is �xD97C; �"
Nope, � is U+FFFD, used as replacement character when one encountered an invalid Unicode sequence (like for instance a broken surrogate pair :-)
http://www.fileformat.info/info/unicode/char/fffd/index.htm
So the text pasted in this blog went through yet another application that was UTF-16 aware.
Environmental variable from locals: Environment Count = 55
In theory this could have happened in any of those places. PATH is a likely choice. But it could have been anywhere. Without the debugger it would be a needle in the haystack.
Don't let the look decieve you :-)
Always go copy/paste or even better, bypass that (if you can), and hex dump.
If you do that with the character in your post, it is FFFD.
For instance
- curl -o badCp.html "http://www.hanselman.com/blog/CSIVisualStudioUnableToTranslateUnicodeCharacterAtIndexXToSpecifiedCodePage.aspx"
- hexdump -C badCp.html | less
Shows that the the bytes after binVN\ are <EF BF BD>
That really is U+FFFD (see http://www.fileformat.info/info/unicode/char/fffd/index.htm)
http://everythingfonts.com tries to rended things as Unicode text, but the text is not valid, and various browsers will react to that in various ways (it is very likely that we don't even see the same thing :-)
So the browser takes U+D97C and tries to render it, but that is invalid stand-alone surrogate (should be pairs of high/low), so it uses U+FFFD instead. Take a look at anything between http://everythingfonts.com/unicode/0xD800 and http://everythingfonts.com/unicode/0xDFFF (the surrogate range) and you will see the same thing, a "black diamond with a white question mark". Same as http://everythingfonts.com/unicode/0xFFFD.
I usually recomend http://www.fileformat.info, as it uses SVGs to render the characters.
Otherwise there are too many layers trying to "fix" things (browser, OS text engine, ...)
See in this case http://www.fileformat.info/info/unicode/char/d97c/index.htm (it also says "U+D97C is not a valid unicode character.")
And http://www.fileformat.info/info/unicode/char/fffd/index.htm (the "real black diamond with white question mark" :-)
Mihai
Comments are closed.