You're just another carriage return line feed in the wall
I love getting pull requests on GitHub. It's such a lovely gift when someone wants to contribute their code to my code. However, it seems there are three kinds of pull requests that I get.
- Awesome, appreciated and wanted.
- Not so good, thanks for trying, but perhaps another time.
- THE WALL OF PINK
I'd like to talk about The Wall of Pink. This is a pull request that is possibly useful, possibly awesome, but I'll never know because 672 lines (GitHub tells me) changed because they used CRs and I used LFs or I used CRLF and they used LF, or I used...well, you get the idea.
There is definitely a problem here. But what's the problem? Well, it's kind of like endianness, except we're still talking about it in 2013.
"A big-endian machine stores the most significant byte first—at the lowest byte address—while a little-endian machine stores the least significant byte first." - Endianness
Did you know for a long time Apple computers were big endian and Intel computers were little endian? The Java VM is big endian. I wrote shareware code generator 16 years ago that generated a byte array on an Intel PC that was later entered into a PalmPilot running a Motorola 68328. This was the last time I thought about endianness in my career. Folks working on lower-level stuff do think about this sometimes, admittedly, but the majority of folks don't sweat endianness day to day..
TCP/IP itself is, in fact, big endian. There was a time when we had to really think about the measurable performance hit involved in using TCP/IP on a little-endian processor. But we don't think about that anymore. It's there but the abstraction is not very leaky.
It's years later, but CR/LF issues plague us weekly. That Wall of Pink I mentioned? It looks like this. I had to scroll 672 lines before I saw the +green where the added lines were added. Who knows what really changed here though? Can't tell since this diff tool thinks every line changed.
Sigh.
Whose fault is this?
Perhaps we blame Émile Baudot in 1870 and Donald Murray in 1899 for adding control characters to instruct a typewriter carriage to return to the home position plus a line feed to advance the paper on the roller. Or we blame Teletype machines. Or the folks at DEC, or perhraps Gary Kidall and CP/M for using DEC as a convention. Then the bastards at IBM who moved to ASCII from EBCDIC and needed a carriage return when punch-cards fell out of favor.
The text files we have to day on Windows still have a CR LF (0D 0A) after every line. But Apple uses just uses a line feed (LF) character. There's no carriage to return, but there are lines to advance so it's a logical savings.
Macs and PCs are sharing text more than ever. We live in a world where Git is FTP for code, we're up a level, above TCP/IP where Endianness is hidden, but still in text where CR LF's aren't.
We store our text files in different formats on disk, but later when the files are committed to Git, how are they stored? It depends on your settings and the defaults are never what's recommended.
You can setup a .gitattributes per repo to do things like this:
*.txt -crlf
Or you can do what GitHub for Windows suggests with text=auto.
# Auto detect text files and perform LF normalization
* text=auto
What's text=auto do?
This ensures that all files that git considers to be text will have normalized (LF) line endings in the repository. The core.eol configuration variable controls which line endings git will use for normalized files in your working directory; the default is to use the native line ending for your platform, or CRLF if core.autocrlf is set.
It uses the native line ending for your platform. But if you spend a few minutes googling around you'll find arguments several ways with no 100% clear answer, although most folks seem to believe GitHub has the right one.
If this is the right answer, why isn't it a default? Is it time to make this the default?
This is such a problem that did you know GitHub for Windows has dedicated "normalize your repo's CRLF" code? They'll fix them all and make a one-time commit to fix the line endings.
I think a more complete solution would also include improvements to the online diff tool. If the GitHub repro and server knows something is wrong, that's a great chance for the server to suggest a fix, proactively.
Solutions
Here's some possible solutions as I see it.
Make Windows switch all text files and decades of convention to use just LF- Git needs to install with correct platform-specific defaults without needing .gitattributes file
- Have the GitHub web application be more proactive in suggesting solutions and preventing badness
- Have the GitHub for Windows desktop application proactively notice issues (before I go to settings) and offer to help
- Make the diff tool CR/LF aware and "do the right thing" like desktop diff tools that can ignore line ending issues
Until something is done, I'll always tense up when I see an incoming pull request and hope it's not a Wall of Pink.
Thoughts?
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
At worst it could be an option!
As you've pointed out, on many occasions, it's a diverse world out there, and, as GitHub can and does accept contributions from multiple platforms, ignoring minor differences in formatting when diffing files is a way of removing one of those annoyances we get when sharing things between OS's.
Imagine rejecting someone's submission because they'd revised the code under Mono on Linux and the file was originally edited in Visual Studio on Windows leading to the 'sea of red'. Sure, the line endings may be different, but they don't matter :-)
So my vote, in this world of lots of different people on lots of different operating systems, is to make the diff tool aware of these potential differences and to allow them to be ignored. Make it an on/off button to keep people (and Whitespace developers) happy. I don't really like the other options as they are more a "my platform is right, yours is wrong" viewpoint.
Dash
PS I really, really wish I could remove my first two comments :-D
http://stackoverflow.com/questions/553548/what-does-visual-studio-mean-by-normalize-inconsistent-line-endings
Besides that, any language where none-visual characters (indent included and mainly) are relevant for syntax is bad, to say the least.
When you notice CR/LF problems popping up: clean your code, rebase it to squash commits and force push to update your feature branch from where the pull request was made.
Then check your pull request again and repeat when needed.
When I create a pull request I want the receiving end to focus on my functional changes. And we have the tools to prevent this diff explosion from happening. Although I agree it would be nice if Github would catch this in their diff.
In a way it reminds me how Pascal would complain about a missing period '.' at the end of programs, but never default to putting it there for me ;)
Regarding languages with significant whitespace like Python: I would be surprised if any of these languages didn't have some kind of compiler fix for different line endings. The tabs should be the same on all platforms shouldn't they?
That said: I think that windows (which is my platform of choice) should stop using CRLF for line endings, it seems a bit silly, using a standard tied to an old mechanical platform.
Jesper Hauge
The tools should handle it. It shouldn't be the pull request-maker's job.
The "good" solution will be: Developers, configure all your tools to use LF. Any developer tool minimally mature is able to handle "the problem" and transform the code transparently...
The "reasonable" solution I'm afraid is to make GitHub aware of the differences. Ideally, it should be possible to "set your project" to use LF (or, gods forbid CR/LF)
At the very least, GitHub should give you the option to ignore LF differences in diff... But diff in GitHub is a joke. Curiously, I wrote about it earlier this week on my blog (http://wrongsideofmemphis.com/2013/02/19/github-for-reviewing-code/)
In short, the diff page and review needs a lot of work, and most of it is pretty basic (like side-by-side diff) and I don't know why GitHub hasn't done that yet...
The problem with that is that I can't see any movement around that recently in GitHub....
That's the way it should be, right? I mean, you can't expect all developers in the world to be aware of this, and change their development environment!
People saying windows should change to only LF, keep on dreaming, what about the million lines of legacy code?
I noticed something strange in Chrome lately, the error text from failed ASP.NET requests shows up as one big line of error text in the developer console, is this also related to these CRLFs?
In other words: if you're a windows developer posting an open source C# project on GitHub and forcing me to use an arcane version control system (from the commandline, no less) then you should consider yourself lucky you get any pull requests at all.
Why don't you fork git, fix the problem, and send them a pull request?
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
But as a pragmatic approach setting the diff tool to be new line aware would be best in my opinion.
Or use Winmerge ;)
My thought is that normalization should occur on the server IF it makes it past the developer based on pre-configured defaults, i.e. enable the configuration of the Git to denote a default EoL and normalize to it on commit.
Outside of that, just making the "right" settings the "default" within the tools would probably solve a lot of headaches.
But CR and LF are different. CR moves the cursor to the beginning of the line, and LF moves it down a line.
That is the archaic meaning of CR and LF. I doubt that there is any modern use for using two characters to indicate a newline--everyone always wants the text to begin one line down, at the left side.
And developers should check their diffs before submitting a change: if you changed 20 lines of code but diff shows 600, don't initiate a pull request but fix it first. And when you receive a pull request like that you can kindly ask the sender to clear up the change first, maybe with a link to this blogentry or some related articles/man pages.
Set it....and forget it.
https://twitter.com/codinghorror/status/298913370646642689
https://github.com/blog/967-github-secrets
That is the archaic meaning of CR and LF. I doubt that there is any modern use for using two characters to indicate a newline--everyone always wants the text to begin one line down, at the left side.
The windows terminal still uses this convention, and for text based applications it makes sense (it you want to override the current line use CR, if you need a new line use CF+LF).
I'll second the link: https://github.com/blog/967-github-secrets
Incidentally, it is better than it used to be. Pre-OS X, Mac's used <CR> for line-endings, with *nix on <LF>, and Windows using <CR+LF>.
Absolutely! I'm facing an issue now with imported data in my web app having a mixture of CR, CR+LF and LF which renders differently in text boxes in different browsers. Some show the text as one-line and others (correctly) split the text across lines.
However, even if I do replace CR+LF and CR to just LF in my import (or whatever I decide to use) - I still have the different browsers (IE, Chrome, FF, Opera, Safari etc.) submitting either CR, CR+LF, LF when the form is posted. There's no consistency across browser versions either.
@Scott - it must be National CRLF week, as I've literally hit this issue in the last 24 hours and my weekly checkup of your blog came up with this!
CR... LF........ pah!
Mal
https://github.com/blog/967-github-secrets
I think online repos like github and bitbucket just need to be more intelligent with treating whitespace. I don't have these issues in Winmerge or Beyond Compare.
That is the archaic meaning of CR and LF. I doubt that there is any modern use for using two characters to indicate a newline--everyone always wants the text to begin one line down, at the left side.
As someone who's built a telnet server and online multi-player RPG, I have to disagree with the statement made above as posted. Terminal emulation is still used in many places where CR and LF have very different meanings.
That being said, I think if we dropped the blanket "everyone always wants" and instead said for source code repo purposes we could all agree on one standardized end of line character, most would be agreeable.
This is normally not a problem for people who are all developing in the same environments, but this can become a much larger problem on projects where there are differing operating systems underneath the commits. With the rise of people working on Ruby and node in windows, this is an issue that more people need to be monitoring.
While it is very simple to add the proper .gitattributes file and re-normalize a repository, it's highly deterring to people wanting to contribute to be shot down only because of something as inconsequential as a line ending configuration.
A bit out of topic - but the code looks like It's open for SQL injection attack ?
unless of course, tablename, column and where were checked against schema.
cheers.
I'm not 100% sure that this fixes your problem, but Editor Config is a way of specifying this for your project.
You specify in a text file in the root of your project what should be used for line feeds as well as tabs and spaces.
There is a plugin for VS that works quite well. Other tools support it as well, for example Sublime Text.
Thanks,
Don
- "Why don't you just use my one-off personal fix to fix this just for yourself?" - 22
- "I don't have a fix to propose but I'm commenting anyway" - 10
- "I agree, we need a universal fix" - 6
Sometimes it seems like the majority of programmers are narrow-minded "I have hammer! Bang!" individuals who rather enjoy smashing the same nail over and over again, not realizing that millions of others are also wasting time smashing the same exact nail, each in their own way.
I liked Jaime Buelta's best/good/reasonable dichotomy. It turns out that "best" and "good" aren't actually reasonable in this case, causing people despise the "reasonable" solution.
I also agree with Noah Coad that it would help if you made a declarative statement about the convention/fix. As evidenced by the responses to this post, this is a social/cultural problem as much as a technical problem.
There is no way a system will change convention. The barriers are only partial technical.
Which system should it switch? Browsing the comments I see quite a few "Windows should ..."
Why? Why would an OS with 92% market share change its ways? What is wrong with the Windows (and HTTP) convention?
And btw, Mac used CR until OS X, it only moved to LF when it switched to a Unix kernel. And for a while it was quite painful, as various files used/required different conventions on the same machine.
The pragmatic solution would probably be a mixture:
* decent defaults ("auto") for git and GitHub (few people change the defaults if they are not bothered by them)
* a diff/merge tool that ignores line ending type by default
* maybe some kind of "normalize everything to my settings before showing it to me" option, also by default
Msysgit is the "official" git for Windows. If all clients use the official offering like Gitextensions (and maybe TortoiseGit) do, we shouldn't have this problem.
It says:
The Server did not return properly formatted HTTP Headers. HTTP headers should be terminated with CRLFCRLF. These were terminated with LFLF.
Must be ironic
As a repository maintainer...
- I don't want commits that change every line. What the diff tool is capable of ignoring is beside the point.
- I don't want a mixture of CRLF and LF in the same file. That's just untidy.
- I don't want some files with CRLF and some files with LF line endings. That's also untidy.
As a developer...
- I don't want to think about line endings
- I don't care whether a file has CRLF or LF endings
- I want my tools (be it editor or source control system) to handle line endings automagically for me
Most editors are quite happy to open a file with any style of line endings and go with the flow. I'm actually quite surprised to find that the Wall of Pink turns up at all. However, I have a suspicion: On Windows, people use msysgit, and msysgit asks a very difficult question during install: What do you want to do about line endings?
http://uncod.in/images/msysgit7.png
How on earth should you know what to do here? There is a valid argument for all three options.
Personally, I would go for the option to check out as is, check in normalized LF. Editors should deal with files containing just LF and do the same when you edit the files, while Git should detect that you are trying to commit a text file with CRLF and normalize that to LF.
I know exactly what you mean with msysgit, and I always have to take a break and think when I see it :-)
==== Rant part, not targeted at @Ove =====
The clean thing (I think) is to "check-out using platform preferred style" and "I don't give a dime how you check-in". If git decides to convert all line-ending to Unicode line separator (U+2028) or something else, I don't care, as long as it is done consistently. A bit like the libraries dealing with TCP/IP, where you don't get to "see" the endianess that the TCP/IP headers use on the wire.
That would be the cleanest thing. There are some exceptions though, here are some examples:
* when you deal with other "smart" tools (for instance cygwin will ask you at install time what line ending to use, and if you choose UNIX, then some of the tools will choke on the Windows native conventions)
* sometimes this might break unit testing, if your unit test produces files that you compare in binary mode with reference files from source control
Also, you don't want to convert what looks like new-lines inside a .jpg file :-) So you need control by type, with local overrides at file level if the developer wants that.
This is something that other tools (i.e. perforce) solved long ago. Maybe because it was designed with cross-platform in mind, they make money from it, and don't come with the "we are right, Windows should change" attitude. Things are what they are, just think about it and solve it, don't pass the buck to thousands of developers by asking silly questions.
I think you're asking why you've never tried using Beyond Compare as your diff tool. You won't use the built in tools again after you have and no, I don't work for them.
2. is crlf a skeumorphism?
3. @Anthony Capone: the secret feature for github needs ?w=1 as parameter, not w=t
4. sometime i lie awake at night and dream about the idea that someone sometime fixes these kind of problems (utf vs. ascii, bom, cr/lf, ...). at least i can dream it
5. i am with @mihai:
The pragmatic solution would probably be a mixture:
* decent defaults ("auto") for git and GitHub (few people change the defaults if they are not bothered by them)
* a diff/merge tool that ignores line ending type by default
* maybe some kind of "normalize everything to my settings before showing it to me" option, also by default
<rant>Also, anyone who believes that Git is an "arcane version control system" and that command line tools are obsolete/implicitly hard to use should get out of the business, right now. A pointing device is not a tool you _want_ to use in development, it's for games and badly designed web sites/applications.</rant>
Were building an app on windows, but one of the systems we talk to is written in Java, presumably from a more *nix based world. The encryption config files it uses use LF for the end of line codes. When the build server pulled those files out of git it changed the line endings and the library couldn't read the files any more.
Whilst the library was clearly at issue, most of the solutions proposed here of change the source code as you commit it and pull it would have left us in a bad state if we couldn't override it somehow.
I know this problem only too well and completely agree ...
You could fork the git source find and replace all CRLF with LF / vice versa then check it back in again.
See how the git guys handle that ... lol.
Probably not worth wasting your time on it though ... smarter tooling is definately the way to go ;)
A problem with converting line-endings on checkout is that the checksums won't match. So you're debugging C++ in VS, and VS tells you that the .cpp doesn't match."
If the conversion happens at checking and checkout, then VS will not notice anything (it was CRLF before checkin, and it is CRLF after checkout). It can be a problem if the checksum is verified cross-platform (do the checksum on Win with CRLF, submit, chechout on UNIX with LF, and the verify checksum).
But I have never seen such a workflow. And I would probably hate to work with it :-) Why would I need a checsum that needs to be kept in sync with my sources?
We had a problem where the proposed solution of defaults actually ended up breaking our systems:
...
Whilst the library was clearly at issue, most of the solutions proposed here of change the source code as you commit it and pull it would have left us in a bad state if we couldn't override it somehow.
This is why the version control should allow type override at file level, like perforce does.
Perforce has "text files" (where line ending gets converted) and "binary files" (where line endings stay untouched, just a bunch of bytes)
Can be smarter, for instance binary, textCrLf, textCr, textLf, but the idea is the same: allow for file level override.
On top of decent defaults, of course.
==========
@DaveWill
I think if LF were treated as pure line feed and CR were treated as pure carriage return
++1;
Because they are. I have seen (and done it myself) the CR used to go back at the beginning of the line and override stuff (for a progress report), without LF.
If everyone used "commit as is, checkout as is" this problem would never happen as all Windows devs would use CR/LF and Git would store it as such.
Anyway, I feel it's a configuration problem. I like GitHub's solution.
The VS2010 however keeps insisting on converting my coworkers' line endings every time I open a file... :)
All of these options could be presented automatically as part of the check-in and approval workflow, unless they have specifically been set as defaults by the repository owner. You could even have the option to normalize everything to the repository defaults on check-in by the developer. You could maybe even detect the code editor default on check-out, normalize to the editor settings, and then automatically revert to the repository default on check-in. None of these things are difficult features to implement, and many of us say good software would do a much better job of abstracting everything to make the user experience painless.
But this really all goes back to an old difference of ideology between UNIX and Windows, and Git has certainly gone with the UNIX ideology, giving you maximum flexibility with the tradeoff of minimum abstraction. This is the very reason I just don't use Git. I prefer the abstraction provided by my code repository of choice over the flexibility of Git. Unlike the Windows API battle, which appears to have finally declared a winner, I don't expect to see a clear winner anytime soon among code repository architecture. We've had this difference in ideology for at least the past thirty years. If you prefer the abstraction like me, it's probably not a good idea to use software built with the UNIX ideology like Git.
Now, I could be completely missing something, but I believe there is an organization working on exactly what you're asking for Git to implement in their repository solution. Someone finally realized that it's actually pretty simple to integrate the abstraction based architecture with the other architecture. There are others that have discovered this same thing and turned it into an opportunity.
Then the bastards at IBM who moved to ASCII from EBCDIC and needed a carriage return when punch-cards fell out of favor.
Actually, EBCDIC differentiates between the control characters CR (carriage return), LF (linefeed) and NL (new line) at 0x0D, 0x25 and 0x15 respectively.
My self, I use Beyond Compare and dial in the whitespace importance I want. One of the settings for whitespace importance is 'Compare Line Endings (PC/Mac/Unix)'. Unchecked, it treats them all as the same and pays them no nevermind, rather like the CLR does.
CR != LF. They mean *utterly* different things.
The biggest complaint I see against 'CR' is 'Well, we don't use typewriter carriages.' So then say 'Cursor Return,' if you want. It doesn't matter. If I ask for an LF at Col 47, then output 20 more characters, I should be able to assume I'm on Col 67, not Col 20.
But newlines are a very specific problem, because in most cases they're super easy to detect, and they only affect one when they aren't handled properly. Otherwise they're completely transparent. That's why I see a chance to make things right someday (when everyone finally agrees on \n).
Frankly, I would like to see a common configuration file for all IDEs and all platforms that specifies common formatting for everything and the particularities of each language/format.
Part of the rub being that FTP *had* an ASCII transfer mode to mediate the CR-CRLF conflict, yes?
:)
Comments are closed.
http://manual.winmerge.org/Configuration.html