ABAW Challenge #2 – Site Reliability Engineering: How google runs production systems

SITE RELIABILITY ENG

It is the second week of ABAW Challenge and the book I picked for this week was Site Reliability Engineering: How Google Runs Production Systems.

I would like to encourage people working on DevOps, Server Administration, DBA and System/Software Architecture and similar roles to read this book.

Probably nobody knows how to run a production system better than the google team. I live in a country where most of us type http://www.google.com on the web browser to check whether the internet is working or not. We trust the availability of the google.com website more than the availability of the internet data connection. Even if http://www.google.com is down, we still believe it is an internet problem, because we trust google to be always up and running.

This book is written by engineers (actually dozens of those engineers) who run the google production systems, the team that is responsible for the availability and performance of google products. This team is called SRE (Site Reliability Engineering Team) within Google.

The main attraction for buying this book and spending a week reading, is the fact that this is written by those engineers who are running my favorite google products. This was part of the efforts to listen to them and to understand their vision, approach, thinking and the way of working. I went over the whole book with full attention and focus to see what I can adopt from what those SRE engineers do within Google.

site reliability engineering

I bought a Kindle version of the book and that is what I read this week. I thought this is going to be a good item to read for my DevOps team, and therefore I bought a printed copy of this book as well.

Overall, I liked this book very much. A lot of the tools, systems and environments described in this book exist only within Google and therefore they did not help much directly. However, this book helped me to understand how those engineers work, create and track SLOs, handling outages, processes and methodologies in place etc. Moving forward, I am going to encourage my team to put more focus on the postmortem reports after every outage, organize and manage those reports and use them as a reference point for training as well as future fixes.

I would like to encourage people working on DevOps, Server Administration, DBA and System/Software Architecture and similar roles to read this book. I would also recommend the DevOps/SRE engineers at Amazon read this as well, because their login page is down for the last 15 minutes and I can’t look at my wish list I compiled there!

amazon login

Next Week

Now that I am feeling more confident about being able to continue this challenge, I am trying to pick the book for next week. The following are on the top of my list and I will pick one of them.

If any of you have read one or more of the above books, I would love to hear your feedback and will make my choice based on that 🙂

ABAW Challenge – The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies

Last week I started a challenge, which I call ABAW Challenge (A book a Week) to motivate myself to read a book every week. It was a serious challenge for me, because my hectic work schedule left me little time to focus on anything else. Interestingly, it is the difficulty level which inspired me to go ahead and attempt this almost impossible mission.

The book I picked last week was The Second Machine Age written by Erik Brynjolfsson and Andrew McAfee.

second machine age

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies

“The Second Machine Age” is a New York Times, Wall Street Journal and Washington Post Bestseller!

I decided to buy an Audible version of the book so that I can listen to it. This allowed me to efficiently use my time for this project, when a normal reading was naturally not possible: such as when driving. The audio recording of the book was 8 hours and 50 minutes and I was able to complete it within a week.

The audible version of the book was narrated by Jeff Cummings and I must say that I loved narration.

Even thought the whole exercise looked like a challenge when I started with this, the journey quickly became very enjoyable and exciting. I did not want to write about this until I was sure that I can continue this exercise for quite some time.

I am a great fan of the industrial revolution, which the authors of the book call ‘First Machine Age’. Years ago, I had read several books on industrial revolution and its history, progress and explosion, watched many movies, documentaries and videos; including the famous Charlie Chaplin Movie Modern Times. Some of the remarks in this book about the first machine age reminded me about all those and it indeed multiplied the fun!

This book is a very interesting read, not only for technology professionals, but also for anyone having an interest in science, technology and computers.

 

 

The secret performance tuning button

secret button

Over the years, I have felt at several occasions that people think there is secret performance tuning button hidden somewhere in the application/environment that a performance tuning expert can locate with his magic wand and turn ON to boost the performance of their application. This is especially evident when they reach out to you indicating how soon they want the problem fixed and how much improvement they expect to see after you turn the button on.

Please note that I do not intend to disrespect their expectations. When someone is responsible for a serious application and he/she is hit with a performance problem, there is certainly an emergency and the rules of emergency are completely different. A performance expert can show some value additions only if he/she can address the emergency in a timely and satisfactory manner. So the expectation of a magic button may be well acceptable in this context.

The fact is that there is no such button or short cut available to turn a performance starving application to performance rich application (in most cases). I have seen such buttons in video games (especially in car racing) but have not seen in real world applications yet.

A performance turning expert usually achieve the desired goals by performing either one or both of the following:

  1. Cut down any unnecessary operations that adds overhead and slows down the application.
  2. Improve the efficiency of the operations by a combination of hardware, software and architecture rework.

Most of the activities involved in a tuning project can be broadly classified into one of the two categories above (usually). But there can always be exceptions which varies from case to case. So I don’t want draw concrete lines here. The goal of this post is to touch this from a very high altitude.

Do-it-yourself options

You don’t need to be a performance tuning expert to get started with some basic troubleshooting. You can start a performance tuning exercise by checking the best practices, check lists and “don’t-do’s”. This is a check list that anyone can use to identify possible traps or shortfalls. Each organization or team may have its own check lists such as database configuration best practices, database development best practices, application programming best practices etc. The best practices may be evolved over period of time gaining knowledge from mistakes done in the past as well as influence from the gradual learning the team is going through. A new hire may bring in a bunch of additional check lists and best practices in to the team which may be the result of his/her experience with the previous projects.

So, in most cases, before an expert can step in and help you (except for emergencies), you can help yourself by quickly checking your code, application, environment, configuration against the known/available best practices and check lists. Most products/platforms publish basic guidelines and check lists to get the best out of their products/platforms.

In many cases, you may be able to resolve some of the performance problems by following the best practices. Keep in mind that there is no single ‘best’ approach for all problems or environments. An approach that works for John may not work for Michael. Each environment, use case, architecture and challenge may be different and the best practices may be used as a set of basic guide lines.

Discovering tuning opportunities

When a performance tuning expert starts looking into your problem, usually he/she will be focusing on identifying possible ‘tuning opportunities’. I believe this is half an art and half a science. So different people can come up with solutions providing different levels of performance results, primarily due to the ‘art’ part of it. I believe, to be able do a good job in performance tuning, one may need to develop an attitude and way of thinking that is performance oriented. It is important to realize that even milliseconds matter. A tuning opportunity that reduces an operation by 1 millisecond may result in huge performance improvement if that operation is used heavily. I was hardly lucky to find single tuning opportunity that saved many seconds with a single change. Instead most of the tuning opportunity that I dealt with were within milliseconds.

Examining the Hardware and Software Layers

I wrote my first program in early 90’s with DBase, which was single user, single threaded, single tier and ran on a single computer. DBase managed the data and user interface of my application. Everything was tightly bound into DBase engine. When a user performs an application function, the command goes directly into the DBase engine and the route was pretty short between the user and the core application code/data.

This is not the scenario today because most applications are accessed by millions of users over the internet simultaneously. The performance of your application depends on a number of hardware, software and network components within and outside your application. On the far end outside your reach exists the end user’s computer, its configuration, the capabilities of the browser running your application, internet speed of the user etc. Within your data centre, a number of hardware components affect the performance of your application such as the data centre bandwidth, capabilities of the router and switches, intranet speed, capabilities of the load balancers, server hardware etc. Within your application, the various components and subsystems such as the presentation layer, business layer, caching layer, database layer, external APIs and any additional layers you may have affects the overall performance of the application.

A tuning expert usually will look at the different layers and subsystems to identify the area where he/she can find the most tuning opportunity. Based on that he/she may plan and execute a tuning plan to achieve the desired performance goals. Sometimes the rework or changes to a single layer may help to achieve the expected performance goals. Many times it may also need rework on multiple layers, introduction or removal of one or more layers as well as major rework on the overall architecture.

Zooming into the layers

In the next several posts, I would like to drill down into each of those layers and share my thoughts and comments on various check lists, best practices, tuning opportunities, common mistakes etc. None of the ideas and approaches I am presenting through these series of posts are claimed to be the ‘best’ or ‘the only’ of its kind.  These ideas and thoughts are shaped through my experience, what I have seen, heard and done over the years and what I have learned from reading, listening or speaking to other experts in the industry.

To be able to serve as ‘Quick Reads’ and avoid boredom, I will try to limit the posts to be around 1000 words. Whoops, It is already 1128 words. Bye, see you soon.

Back to Blogging

I wrote my first blog post over 10 years ago and have been regularly blogging until two years ago. Since the last two years, I have been staying away from active blogging and the last post I wrote was approximately 1.5 years ago at beyondrelational.com.

I have been getting more and more busy with work for the last 5 years. It gradually reduced my writing activity and pushed it to an end over period of time. Even during the years of my active blogging/writing, most of the writings were happening over the weekends or holidays. For the last many years I ended up spending my free time addressing various items at work and that left me no time to focus on writing.

It was always the love and encouragement of my friends and readers which inspired me and kept me actively writing. In the last two years, many of them asked me about the reason why I stopped writing. I got several follow up emails from various readers asking questions as well as requesting additional posts about some of the areas I blogged in the past or topics covered in my books. I feel that the fire for writing has been ignited again and really want/try to dedicate some time for it – moving forward.

Most of my writing in the past were around SQL Server and XML related topics. Over the years, my focus area changed significantly and in version 2.0, I will be sharing my thoughts and experiences around performance, scalability, optimization, distributed architecture and various topics related to dealing with large data and work loads. I wish those topics may trigger some interesting discussions, exchange of ideas and result in a good amount of mutual learning.

Thank you for all the encouragement in over the years and always love your comments, feedback and constructive criticism.