Recent Posts



Thoughts on Completing the Google Cloud Data Engineering Specialization

Last month I completed all 5 courses in the Google Cloud Data Engineering Specialization on Coursera.

You can have a read of the details of the specialization here.

Why did I do this Specialization?

I’m looking at Google’s AI play towards business. I think a couple of things are coming together nicely right now for Google, I mean Tensorflow 2.0 has greatly simplified the Tensorflow ecosystem and makes deployment friendlier. At the same time I think the solutions of the Google Cloud Platform will be like what Google Analytics is for Marketing teams but for Data Science teams.

The Google Cloud Platform (GCP) abstracts a lot of complexity of deploying models and performance tuning models in production. Essentially you set up the bucket where you want to store/ read the data, set up the pipelines to ingest the data, set up a job to preprocess, or train the data as the case may be and job to train or predict as the case may be and hit “go”. The code runs on Google servers, so scales up and down depending on how many customers hit your site, what job you are running etc. It really is just a beautiful, beautiful thing.

It means that data science teams need only now worry about writing and maintaining code to wrangle data, fit models and not have to worry about tuning databases, monitoring performance of production jobs or any of the other stuff we kinda do but also kinda suck at.

So, cool no Data Engineers right? That’s awesome!

No, not at all. In fact there is a fairly steep learning curve to getting this stuff right. There are a huge number of solutions that do different things. You need someone:

  • Who knows and isn’t scared by code including command line, bash experience would be good

  • experience in the past of setting up servers would be handy (with me it was linode servers).

  • Someone who can write Python and ideally is comfortable around OOP, some products require you to fall back to Java.

  • Ideally someone with exposure to machine learning

  • You want someone who is familiar with SQL and database concepts

  • Business acumen. If you are dumb in the way you set up a cloud table or run a job you may end up paying a lot more than you have to. You need to make smart choices from the available options.

So, you still need someone who is deeply skilled and is comfortable around complexity.

Wait, but you are a Data Scientist why are you pretending to be a Data Engineer?

The hardest thing about developing a model is cutting the data, deployment and scaling and maintenance of a model in production. I think actually building a model is really the easy bit. Knowing the whole pipeline from business idea to exploratory data analysis, to model development, to productionizing and scaling ML solutions puts you at the head of the pack.

So, what I am doing here is I’m picking a line where I can see data science teams heading and I aim to get there before other people. So in plain words that’s Tensorflow models running on the Google Cloud Platform.

So for me I see this as my niche. So you are right I can’t compete with a Data Engineer, but I can be a guy who is across running Tensorflow models on GCP

What’s next for me when I have bandwidth is the following courses:

So, this is my niche. This is what I will be doing in the coming month or so and I’ll pumped to report back for your guys on how it goes!

What are your thoughts on the specialization about?

At times the course felt a bit salesy a bit like this guy selling steak knives back in the mid 1990s. If you were to watch the infomercial you would:

  • get an overview of the products

  • have seen what they could do hypothetically (saw a shoe in half)

  • be convinced that you need those knives and that they are good value

You wouldn’t watch the informercial and instantly become a master chef.

This is about the best way I can describe the Google Data Engineering Specialization. Great tools, they look awesome to use. But you never really get the chance to use them in the wild.

Here in Australia if you ever watched a Demtel infomercial for steak knives back in the day it may feel like that.

The specialization is a review and summary of the main solutions in GCP for Data Engineers, but doesn’t give you much hands on experience

At the end of the course you have a bit of an idea of what solutions are out there and what is involved, but you sure aren’t an expert. To do this well you will need to learn a lot more.

Talk me through the individual courses in the specialization

1. Google Cloud Platform Big Data and Machine Learning Fundamentals

  • Intro to GCP

  • A bit about GCP Compute and Storage

  • Data analysis on Google Cloud, Hadoop

  • BigQuery

  • A bit on Tensorflow

  • Intro to data processing architectures

2. Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform

  • Intro to Cloud Dataproc

  • Working with unstructured data

  • A bit about Spark

  • A bit about ML, NLP

3. Serverless Data Analysis with Google BigQuery and Cloud Dataflow

  • BigQuery

  • Cloud Dataflow

4. Serverless Machine Learning with Tensorflow on Google Cloud Platform

  • Intro to ML

  • Intro to Tensorflow

  • Cloud ML Engine for deploying models

  • Feature Engineering

5. Building Resilient Streaming Systems on Google Cloud Platform

  • Complications of streaming data

  • Pub/Sub

  • Processing streams

  • Ingesting data in Bigtable

Talk me through the products in the GCP for Data Engineers Specialization

Check out this amazing post by Kayleigh Rix

These are essentially the main products you need to know. So go and read the docs here for the products below:

  • Cloud Storage

  • Cloud SQL

  • Big Table

  • Big Query

  • Pub/Sub

  • Cloud Dataflow

  • Cloud DataProc

  • Tensorflow

  • Cloud Datastore

Should you do the specialization?

I thought it was good exposure to the GCP products, but even the guy doing the course said it isn’t sufficient for passing the certification exam, and presumably for doing work. I think there is nothing like solving real problems out in the wild rather than just running code line by line in the tutorials.

On the tutorials, there were a couple that were a bit janky, required you to magically refresh your environment a couple of times for them to work. So that was a bit painful.

But on the whole you get a good overview of the information all in one place, it serves as a signal to a prospective employer, but it won’t necessarily mean that you are a guru GCP Data Engineer. If all the job ads here in Australia are just asking for exposure to GCP - well that’s a fair indication that I have had exposure to GCP.

The really great thing is that it is free to run the labs on GCP, you don’t get billed. I know when you sign up for GCP you get $300 credits, but it may not take very much practice for you to burn through a fair bit of that.

So, I’m going to say yeah it is not a bad badge to have.

Should I do the certification?

I had a look at 10 data engineer job openings here in Australia, it might be different in other markets but here it seems to be really hard to find people to do this work.

Of the 10 only 1 mentioned GCP Certification as a desirable, not essential skill. So, I’m going to say probably not at the moment at least here in Australia. The real idea is to check out the jobs in your city, see if they are asking for GCP certification, if not then why do the certification? If all you need is exposure, then do the Specialization and then read the docs and practice on the job.

If your company wants to become a Google Technology Partner things may be a little different, in that you may need at least a few people to become certified. But certification is not something I’m choosing to do right now.

What about the other course on Coursera “Preparing for the Cloud Professional Data Engineer Exam”?

If you are interested in doing the certification I would certainly do the Coursera course “Preparing for the Cloud Professional Data Engineer Exam” I mean it is a course directed solely at the exam, it has to be a good thing to do to prepare.

What are some more resources if I want to do the certification, or I want to know more?

I would start by looking at what is required in the Data Engineering Certification, the guide can be found here.

A fairly chunky section of the internet is now devoted to passing the Google Cloud Data Engineering Certification. There are companies out there trying to fill the gap between the Coursera Specialization and the Google Cloud Data Engineer Certification.

If I wanted to do the certification I would actually plunder these courses for common topics, I would be inclined to train myself by looking at the documentation, but you could take one of these Udemy courses if you felt like it:

If I was working in an office I would speak to my boss about shadowing a Data Engineer, or trying this stuff out in earnest on a real project. I think this is probably the best way to learn the technology.

There’s also a practice exam: