ABAW Challenge #2 – Site Reliability Engineering: How google runs production systems

SITE RELIABILITY ENG

It is the second week of ABAW Challenge and the book I picked for this week was Site Reliability Engineering: How Google Runs Production Systems.

I would like to encourage people working on DevOps, Server Administration, DBA and System/Software Architecture and similar roles to read this book.

Probably nobody knows how to run a production system better than the google team. I live in a country where most of us type http://www.google.com on the web browser to check whether the internet is working or not. We trust the availability of the google.com website more than the availability of the internet data connection. Even if http://www.google.com is down, we still believe it is an internet problem, because we trust google to be always up and running.

This book is written by engineers (actually dozens of those engineers) who run the google production systems, the team that is responsible for the availability and performance of google products. This team is called SRE (Site Reliability Engineering Team) within Google.

The main attraction for buying this book and spending a week reading, is the fact that this is written by those engineers who are running my favorite google products. This was part of the efforts to listen to them and to understand their vision, approach, thinking and the way of working. I went over the whole book with full attention and focus to see what I can adopt from what those SRE engineers do within Google.

site reliability engineering

I bought a Kindle version of the book and that is what I read this week. I thought this is going to be a good item to read for my DevOps team, and therefore I bought a printed copy of this book as well.

Overall, I liked this book very much. A lot of the tools, systems and environments described in this book exist only within Google and therefore they did not help much directly. However, this book helped me to understand how those engineers work, create and track SLOs, handling outages, processes and methodologies in place etc. Moving forward, I am going to encourage my team to put more focus on the postmortem reports after every outage, organize and manage those reports and use them as a reference point for training as well as future fixes.

I would like to encourage people working on DevOps, Server Administration, DBA and System/Software Architecture and similar roles to read this book. I would also recommend the DevOps/SRE engineers at Amazon read this as well, because their login page is down for the last 15 minutes and I can’t look at my wish list I compiled there!

amazon login

Next Week

Now that I am feeling more confident about being able to continue this challenge, I am trying to pick the book for next week. The following are on the top of my list and I will pick one of them.

If any of you have read one or more of the above books, I would love to hear your feedback and will make my choice based on that 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s