Recent Posts



Why Git?

Data scientists are not software developers, but I’d argue that as time pushes on they are having more and more in common with software developers. Many data scientists are shipping data products into production systems, but even for those who are extracting insights having a knowledge of version control is critical.

I have explained in previous posts the central role of version control for executing the Microsoft Team Data Science Process. (which I consider to be the best way to run a data science team.)

There is also a strong argument to preserve your own sanity. If you have ever seen those directories that have multiple versions of the same file, or if you are sending code via email for someone to use having a Git repository for your team could save you a lot of time and pain.

Overview of Git

Here is one of the best and most straightforward explanations of Git I have seen. The diagrams are great, and he covers a lot of ground very quickly. An excellent high level overview of Git! Watch the first 30 minutes of this course:

CS 50 Web Development Course: Lecture 0

Simple Summary of Git

Then you should take a look at this great and hilarious summary of Git:

Git - the Simple Guide

More Detailed Summary of Git, GitHub and Setup

Of course Hadley Wickham has put together a fantastic summary of Git and GitHub. There is a bit about RStudio, but push through it because there are command line equivalents. He does a good job of introducing more complex topics like pull requests.

Hadley’s Git Guide

Practice with Git Commands

This website is more than just Git branching, it is a collection of practical exercises that can be run in your browser. Well worth your time going through these exercises.

Learn Git Branching

Now, if you want to get your hands dirty not in the browser I’d recommend Coursera’s Git Course.

Pro tips

  • Learn with the command line. It isn’t that hard, and if things really screw up it is sometimes easier to have the command line as an option. GUIs are great, but the learning is a bit shallow.

  • Use the “git status” command liberally, it’s a great check of what is going on, where the files are at.

  • Read the messages Git gives you, it is trying to help you and it is pretty awesome.

  • Stay with it, Git can be frustrating at first.

  • Make your messages concise and descriptive, don’t say “stuff” and “more stuff”

  • Branch off when you work, don’t push to master