Reliability Engineering: High Availability, Resilience & Observability
placeNieuwegein 29 juin 2026 jusqu'au 30 juin 2026voir détails event 29 juin 2026, Nieuwegein, Day 1 event 30 juin 2026, Nieuwegein, Day 2 |
placeNieuwegein 8 oct. 2026 jusqu'au 9 oct. 2026voir détails event 8 octobre 2026, Nieuwegein event 9 octobre 2026, Nieuwegein |
placeNieuwegein 17 déc. 2026 jusqu'au 18 déc. 2026voir détails event 17 décembre 2026, Nieuwegein event 18 décembre 2026, Nieuwegein |
Modern IT systems are complex, distributed and constantly evolving. Reliability does not happen by accident — it must be actively built, monitored and improved.
In this training, you will learn how to build and operate systems that remain stable under pressure, handle failures in a controlled way and provide continuous insight into their behavior. The focus is not only on infrastructure, but especially on applications and microservices: how software behaves in production and what is required to keep it reliable.
You will work with principles from Site Reliability Engineering (SRE) and learn how development and operations come together in a DevOps way of working. You will see how decisions …
Il n'y a pour le moment aucune question fréquente sur ce produit. Si vous avez besoin d'aide ou une question, contactez notre équipe support.
Modern IT systems are complex, distributed and constantly evolving. Reliability does not happen by accident — it must be actively built, monitored and improved.
In this training, you will learn how to build and operate systems that remain stable under pressure, handle failures in a controlled way and provide continuous insight into their behavior. The focus is not only on infrastructure, but especially on applications and microservices: how software behaves in production and what is required to keep it reliable.
You will work with principles from Site Reliability Engineering (SRE) and learn how development and operations come together in a DevOps way of working. You will see how decisions in application behavior, dependencies and integrations directly impact availability, performance and recovery.
You will learn how to handle failures in practice: from retries and backpressure to circuit breakers and graceful degradation — not as isolated patterns, but as part of systems that continue to function under real-world conditions.
Observability plays a central role: you will work with metrics, logs and traces, and learn how to use SLI’s, SLO’s and error budgets to make reliability measurable and to align it with user experience and business impact.
You will also gain insight into data reliability and distributed systems behavior, including consistency trade-offs (CAP and PACELC), so systems are not only available, but also correct.
The training covers the full lifecycle: build, deploy, monitor, validate and improve. You will learn how to test reliability with resilience testing and chaos engineering, and how to continuously improve based on production data.
The course material (slides) is in English and reflects real-world practices in modern IT organizations.
This training is available as classroom training and as e-learning. Classroom sessions can be attended on-site or virtually (via Microsoft Teams or Zoom). The e-learning is fully in English and includes English subtitles.
Our training is also delivered through selected international training partners, allowing participation outside the Netherlands. Contact us for current availability and locations.
Who should attend:
This training is designed for technical professionals involved in building, operating and improving modern IT systems.
Typical participants include:
* DevOps and platform engineers
* Software engineers
* Solution and cloud architects
* IT managers and technical leads
What you will learn:
* How systems and microservices behave under failures and peak load
* High availability and failover in practice (zones, regions and dependencies)
* Resilience strategies such as retries, backpressure, circuit breakers and graceful degradation
* How to use SLI’s, SLO’s and error budgets to manage reliability
* Observability with metrics, logs and traces, and the move toward system intelligence
* Trade-offs in distributed systems such as CAP, PACELC and consistency vs availability
* How to ensure data reliability (replication, recovery, integrity and consistency)
* How to validate reliability with testing and chaos engineering
Results:
After this training, you will be able to:
* Build and operate more reliable systems and microservices in production
* Detect, understand and resolve issues faster using observability
* Make better decisions on how systems handle failures and dependencies
* Align reliability with user experience and business impact
* Collaborate more effectively within DevOps teams
* Continuously improve system reliability instead of only reacting to incidents
Course Agenda
- Architecture in Practice
- Scope, Mindset & Shared Language
- Software Resilience & Designing for Failure
- High Availability Architecture
- Safe Change & Delivery Reliability
- Data Reliability & State Management
- Resilience Validation & Chaos Engineering
- System Intelligence & Observability
- Adoption, Governance & Reliability Maturity
- Architecture in Practice: Understand how modern systems evolve in real-world environments, driven by trade-offs, simplicity (KISS) and continuous change. Learn why reliability is influenced by decisions across the entire lifecycle — not just infrastructure.
- Scope, Mindset & Shared Language: Build a solid foundation in reliability engineering, including SLI, SLO and error budgets. Learn how SRE principles and reliability economics guide both development and operational decisions.
- Software Resilience & Designing for Failure: Build applications and microservices that handle failure gracefully using retries, backoff, circuit breakers and bulkheads. Prevent cascading failures with loose coupling, backpressure and isolation patterns.
- High Availability Architecture: Understand how systems stay available in practice using redundancy, failover and multi-zone or multi-region setups. Learn how to control blast radius and design for predictable recovery.
- Safe Change & Delivery Reliability: Deliver changes safely using CI/CD, GitOps, Infrastructure as Code and progressive delivery strategies such as canary and blue/green deployments. Use guardrails, feature flags and automated policies to reduce risk.
- Data Reliability & State Management: Manage data consistency, replication and recovery in distributed systems. Understand CAP and PACELC trade-offs, eventual consistency and how to prevent data loss or corruption.
- Resilience Validation & Chaos Engineering: Validate system behavior under stress using resilience testing and chaos engineering. Define steady state, test realistic failure scenarios and safely experiment in production environments.
- System Intelligence & Observability: Go beyond monitoring with observability, tracing and system intelligence. Apply concepts such as the four golden signals and use data-driven insights to detect, understand and prevent issues.
- Adoption, Governance & Reliability Maturity: Scale reliability across teams using platform engineering, policy-as-code and governance models. Implement guardrails, maturity models and continuous improvement loops.
This reliability engineering training helps you build, operate and continuously improve systems that remain stable under real-world conditions.
Il n'y a pour le moment aucune question fréquente sur ce produit. Si vous avez besoin d'aide ou une question, contactez notre équipe support.


