(Log in to vote)
Site Reliability Engineering : How Google Runs Production Systems (2016)

Call Number 620.00452/SITE

All Copies Checked Out
(0 holds on 1 copy)
LocationCall NumberItem Status
Adult Nonfiction620.00452/SITEDue 04-25-19
Published: Sebastopol, CA : Oreilly, 2016
Edition:  First edition, April 2016
Description:  xxiv, 524 pages : illustrations ; 24 cm
ISBN/ISSN: 9781491929124, 149192912X, 9781491929124,
Language:  English

"The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient - lessons directly applicable to your organization. This book is divided into four sections: Introduction - Learn what site reliability engineering is and why it differs from conventional IT industry practices; Principles - Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE); Practices - Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systems; Management - Explore Google's best practices for training, communication, and meetings that your organization can use."--Publisher's description

Introduction. The production environment at Google, from the viewpoint of an SRE -- Principles. Embracing risk -- Service level objectives -- Eliminating toil -- Monitoring distributed systems -- The evolution of automation at Google -- Release engineering -- Simplicity -- Practices. Practical alerting from time-series data -- Being on-call -- Effective troubleshooting -- Emergency response -- Managing incidents -- Postmortem culture: learning from failure -- Tracking outages -- Testing for reliability -- Software engineering in SRE -- Load balancing at the frontend -- Load balancing in the datacenter -- Handling overload -- Addressing cascading failures -- Managing critical state: distributed consensus for reliability -- Distributed periodic scheduling with Cron --Data processing pipelines -- Date integrity: what you read is what your wrote -- Reliable product launches at scale -- Management. Accelerating SREs to on-call and beyond -- Dealing with interrupts -- Embedding an SRE to recover from operational overload -- Communication and collaboration in SRE -- The evolving SRE engagement model -- Conclusions. Lessons learned from other industries

Related Searches:
Google (Firm) -- Management
Systems engineering -- Management
Reliability (Engineering)
Internet industry -- Management -- United States
Added--201806 anf

Additional Credits:
Beyer, Betsy, editor
Jones, Chris (Computer engineer), editor
Petoff, Jennifer, editor
Murphy, Niall Richard, editor

Login to write a review of your own.

Login to add this item to your list.

Lists can be used to compile collections of items that you may be interested in checking out at a later date. You may also create public lists and share your favorites with other AHML customers.
No tags, currently.

Login to add tags.

To create a multiple word tag such as Science Fiction, enclose both words in quotes, like: "Science Fiction"

Critic Reviews

If your status is Confirmed Registration, your spot for the event is confirmed.

If registration for this event is full, you will be placed on a waiting list. Wait listed registrants are moved to the confirmed registration list (in the order of registration) when cancelations are received. You will receive an email notification if you are moved from the wait list to the confirmed registration list.

6.012 Patron-Generated Content

The Library offers various venues in which patrons can contribute content that is accessible to the public.  These include, but are not limited to, blogs, reviews, forums, and social tagging on the Library’s website and catalog.  Any instance in which a patron posts written or recorded content to any of the Library’s venues that are accessible to the public is considered “patron-generated content” and is subject to this policy.
By contributing patron-generated content, patrons grant the Library an irrevocable, royalty-free, worldwide, perpetual right and license to use, copy, modify, display, archive, distribute, reproduce and create derivative works based upon that content.
By submitting patron-generated content, patrons warrant they are the sole authors or that they have obtained all necessary permission associated with copyrights and trademarks to submit such content.
Patrons are liable for the opinions expressed and the accuracy of the information contained in the content they submit.  The Library assumes no responsibility for such content.
The Library reserves the right not to post submitted content or to remove patron-generated content for any reason, including but not limited to:
  • content that is profane, obscene, or pornographic;
  • content that is abusive, discriminatory or hateful on account of race, national origin, religion, age, gender, disability, or sexual orientation;
  • content that contains threats, personal attacks, or harassment;
  • content that contains solicitations or advertisements;
  • content that is invasive of another person’s privacy;
  • content that is unrelated to the discussion or venue in which it is posted;
  • content that is in violation of the Library’s Code of Conduct or any other Library policy