Recent Posts



How to Learn Pandas for Data Science?

I was an R guy, so I learned the tidyverse really well. So why the heck did I want to go ahead and learn Pandas?

Well there are a few good reasons:

  • Pandas is built on top of numpy, and I’d be learning numpy as well.

  • Pandas and numpy come up a lot in deep learning and machine learning tutorials, if you don’t know how to manipulate data in Python you may not be able to read and understand most of the tutorials out there.

  • Plenty of workplaces are Python shops, so being able to walk in and work with either Python or R is pretty handy

  • It is a heck of a lot of fun, and really isn’t such a big step if you know dplyr in R, even if you know SQL and can think in terms of SQL-like operations this will help too.

So how do you go about it?

Step 1) Buy Python for Data Analysis by Wes McKinney

This book is comprehensive and has sections of everything you will need including:

  • Fundamentals of Python

  • Numpy

  • Data manipulation in Pandas

  • Working with time series data

  • Plotting in Pandas/ matplotlib

The problem is the book will show you what is possible in the ecosystem. You won’t know Pandas from reading the book, learning is pain, so my friends you will have to hurt a little.

If you prefer video tutorials I am a big fan of Harrison Kinsley.

Step 2) Grab yourself a cheat sheet

Something like this will be a handy reference.

Step 3) Go through some pain doing exercises

Here are some exercises on numpy

and Pandas

This will hurt a bit, but it is good for you. You will start learning here. The problem is that you are being prompted for how to solve these problems, so it is like the diet coke of learning Pandas.

Step 4) Learn from scripts, try to replicate them on your own, try different things and understand how the person solved the problem

Ah, now we are starting to venture out. We are seeing how some gun data scientists work with Pandas.

I recently saw a script that someone else wrote, a combination of pure Python, numpy and Pandas to solve a task similarly to what I had done many years ago. Except this solution was so beautiful I wanted to cry, I even said so.

Rather than just thinking “Oh wow, that’s nice” and leaving it at this I grabbed the script and replicated it on my own without peeking too much at syntax, rather just looking at the logic steps. I highly recommend you do this.

Check out some Kaggle kernels and try to replicate them, if the syntax is heavy try to do it more simply, even if it requires more lines of code than the original:

If you still aren’t feeling comfortable take a look at the following Coursera course Introduction to Data Science in Python.

Step 5) Grab a dataset and manipulate it, you will experience uber pain but this is how you learn!

Have the pandas documentation open, it will save you some pain:

Once again jump to our old friend Kaggle and have a look at Datasets, you can grab something interesting for yourself: