The Site Reliability Engineering: How Google Runs Production System book is arguably one of the most prevalent and well-known SRE resources out there, but there are many other great options for folks looking to expand on their learning. We recently spoke to Google's Reliability Advocate, Steve McGhee, in our Humans of Reliability interview series. In addition to his interesting anecdotes on the early days of SRE at Google, and his journey to becoming a Reliability Advocate, he also shared a handful of his favorite SRE resources, which we compiled here into a list: https://lnkd.in/gpUVhyRa
Rootly’s Post
More Relevant Posts
-
Hi Friends! You’ve probably heard the term "SRE," but did you know it was born at Google? Site Reliability Engineering (SRE) combines software engineering with systems administration to keep systems stable, efficient, and scalable. Google’s SRE book is a fantastic resource for understanding this approach. It covers core ideas like the 50/50 𝐫𝐮𝐥𝐞 (𝐛𝐚𝐥𝐚𝐧𝐜𝐢𝐧𝐠 𝐨𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐢𝐧𝐧𝐨𝐯𝐚𝐭𝐢𝐨𝐧), error budgets (managing reliability and speed), and 𝐚𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 (𝐫𝐞𝐝𝐮𝐜𝐢𝐧𝐠 𝐫𝐞𝐩𝐞𝐭𝐢𝐭𝐢𝐯𝐞 𝐭𝐚𝐬𝐤𝐬). Plus, it emphasizes simplicity, 𝐦𝐚𝐧𝐚𝐠𝐢𝐧𝐠 𝐭𝐨𝐢𝐥 (𝐦𝐚𝐧𝐮𝐚𝐥, 𝐫𝐞𝐩𝐞𝐭𝐢𝐭𝐢𝐯𝐞 𝐰𝐨𝐫𝐤), and the value of 𝐛𝐥𝐚𝐦𝐞𝐥𝐞𝐬𝐬 𝐩𝐨𝐬𝐭𝐦𝐨𝐫𝐭𝐞𝐦𝐬 𝐟𝐨𝐫 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐛𝐥𝐚𝐦𝐞. Why does SRE matter? 𝐈𝐭 𝐛𝐮𝐢𝐥𝐝𝐬 𝐫𝐞𝐬𝐢𝐥𝐢𝐞𝐧𝐭 𝐬𝐲𝐬𝐭𝐞𝐦𝐬, 𝐟𝐫𝐞𝐞𝐬 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐬 𝐟𝐨𝐫 𝐢𝐧𝐧𝐨𝐯𝐚𝐭𝐢𝐨𝐧, 𝐚𝐧𝐝 𝐞𝐧𝐚𝐛𝐥𝐞𝐬 𝐟𝐚𝐬𝐭𝐞𝐫, 𝐦𝐨𝐫𝐞 𝐜𝐨𝐧𝐟𝐢𝐝𝐞𝐧𝐭 𝐝𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭𝐬. For a deeper look, I highly recommend the chapters on the 50/50 rule, SRE principles, eliminating toil, and blameless postmortem. Here’s the link to the book: https://lnkd.in/gBZ49hSS
To view or add a comment, sign in
-
In two weeks, I'll be presenting Google's first synchronous online Site Reliability Engineering course. I invite you all to join me — it offers skill-building for current SREs, and establishes a solid foundation for aspiring SREs. Over the past several years, even as I've applied the skills and experience of being an SRE — going on-call, readying software for production release, debugging live systems — I have had the good fortune to continue learning from many others in the SRE community: at meetups, conferences, and through online discussions. All of this comes together in the curriculum for the course. The first part of the curriculum examines the role of the SRE in the context of the production environment: monitoring, alerting, SLOs, and more. We'll answer questions such as "what do your systems tell you? how do you know that your users are getting what they expect?" The first four sessions, "Introduction to Site Reliability Engineering", are free! After that we'll offer "Fundamentals of SRE", with a focus on socio-technical aspects of Site Reliability Engineering: incident management, addressing toil, and troubleshooting systems. The classes have a pretty traditional structure, with suggested reading, direct instruction, homework (yes!), and office hours. ✨ Sign up with our partner, Uplimit, here: https://lnkd.in/gTCvYQZr ✨ We'll repeat the entire set of classes later this year, as well. #SRE #education #learning
To view or add a comment, sign in
-
🔍 Explore the fundamentals of Site Reliability Engineering (SRE) with Google in our new beginner-friendly course! Whether you're new to IT or experienced, dive into SRE's intent, setting Service Level Objectives (SLOs), mastering monitoring, and troubleshooting services efficiently. Start your journey into SRE today! #SRE #Google #IT #Monitoring
To view or add a comment, sign in
-
Google SRE books are great resources focusing on operating IT systems at large scale. The concepts presented can also be used at smaller scale as foundation and best practices. https://sre.google/books/
Site Reliability Engineering
sre.google
To view or add a comment, sign in
-
Discover how Site Reliability Engineering (SRE) revolutionizes software operations with engineering precision. Originally pioneered at Google, SRE blends software engineering principles with infrastructure management, ensuring software systems are not only scalable but also highly reliable. Imagine a proactive approach to system reliability where operations are treated as a software challenge, leveraging automation to minimize human error and maximize efficiency. Learn how SRE teams set and manage Service Level Objectives (SLOs) to maintain system stability while encouraging innovation through error budgets. Explore how SRE transforms incident response into opportunities for system improvement and growth.
To view or add a comment, sign in
-
Are you curious about the rapidly evolving Site Reliability Engineering (SRE)? Whether you're a seasoned IT professional or just starting, Google's course on SRE is perfect and FREE -- https://lnkd.in/dNdQuYEt What you'll learn: -- Dive deep into the role and scope of an SRE. -- Master the art of setting and measuring Service Level Objectives (SLOs). -- Gain proficiency in monitoring, alerting, and troubleshooting services. -- Work on real-world projects and interact with a community of peers. About the Instructor: Salim Virji, a seasoned Site Reliability Engineer at Google, is your guide on this journey. Salim brings a wealth of experience from developing reliable engineering practices at Google and has contributed to several authoritative books on SRE. 📚 Course Structure: -- Week 1: Understand the core of SRE, learn about SLOs, and engage in practical exercises. -- Week 2: Delve into service metrics, monitoring, alerting, and effective troubleshooting techniques. Who Should Enroll? This course is tailored for anyone with an interest in Site Reliability Engineering. No formal background in IT? No problem! Just bring your enthusiasm and a comfort with high-school level algebra. Brought to you by Google, this course is a blend of expert instruction and practical, real-world application. Don't miss this chance to elevate your career or kickstart a new journey in the field of SRE. Apply now to secure your spot! #sitereliabilityengineering #google #uplimit #theravitshow
To view or add a comment, sign in
-
Free to be SRE, with this systems engineering syllabus Learn more about systems engineering and how to get started with these key resources curated by Google’s Site Reliability Engineering
Free to be SRE, with this systems engineering syllabus
cloud.google.com
To view or add a comment, sign in
-
Free to be SRE, with this systems engineering syllabus Learn more about systems engineering and how to get started with these key resources curated by Google’s Site Reliability Engineering
Free to be SRE, with this systems engineering syllabus
cloud.google.com
To view or add a comment, sign in
-
Recently completed an 'Introduction to SRE with Google' course on Uplimit. Learnt the foundations of SRE and how to identify and utilize metrics for monitoring systems.
Introduction to SRE with Google • Ravindra Rao • Uplimit - Live group courses taught by top experts
credential.net
To view or add a comment, sign in
5,731 followers