Your company experiences bugs, outages, and slowness in its production systems. Developers use the production
environment for new feature development and bug fixes. Configuration and experiments are done in the production
environment, causing outages for users. Testers use the production environment for load testing, which often slows the
production systems. You need to redesign the environment to reduce the number of bugs and outages in production and to
enable testers to toad test new features. What should you do?
A
You support a trading application written in Python and hosted on App Engine flexible environment. You want to customize
the error information being sent to Stackdriver Error Reporting. What should you do?
C
Explanation:
References: https://cloud.google.com/error-reporting/docs/setup/app-engine-flexible-environment
You need to define Service Level Objectives (SLOs) for a high-traffic multi-region web application. Customers expect the
application to always be available and have fast response times. Customers are currently happy with the application
performance and availability. Based on current measurement, you observe that the 90th percentile of latency is 120ms and
the 95th percentile of latency is 275ms over a 28-day window. What latency SLO would you recommend to the team to
publish?
B
You support a high-traffic web application that runs on Google Cloud Platform (GCP). You need to measure application
reliability from a user perspective without making any engineering changes to it.
What should you do? (Choose two.)
B D
You are managing an application that exposes an HTTP endpoint without using a load balancer. The latency of the HTTP
responses is important for the user experience. You want to understand what HTTP latencies all of your users are
experiencing. You use Stackdriver Monitoring. What should you do?
A
You are performing a semi-annual capacity planning exercise for your flagship service. You expect a service user growth
rate of 10% month-over-month over the next six months. Your service is fully containerized and runs on Google Cloud
Platform (GCP), using a Google Kubernetes Engine (GKE) Standard regional cluster on three zones with cluster autoscaler
enabled. You currently consume about 30% of your total deployed CPU capacity, and you require resilience against the
failure of a zone. You want to ensure that your users experience minimal negative impact as a result of this growth or as a
result of zone failure, while avoiding unnecessary costs. How should you prepare to handle the predicted growth?
B
You support a web application that is hosted on Compute Engine. The application provides a booking service for thousands
of users. Shortly after the release of a new feature, your monitoring dashboard shows that all users are experiencing latency
at login. You want to mitigate the impact of the incident on the users of your service. What should you do first?
C
You support a large service with a well-defined Service Level Objective (SLO). The development team deploys new releases
of the service multiple times a week. If a major incident causes the service to miss its SLO, you want the development team
to shift its focus from working on features to improving service reliability. What should you do before a major incident occurs?
B
You support a high-traffic web application and want to ensure that the home page loads in a timely manner. As a first step,
you decide to implement a Service Level Indicator (SLI) to represent home page request latency with an acceptable page
load time set to 100 ms. What is the Google-recommended way of calculating this SLI?
C
Explanation:
Reference: https://sre.google/workbook/implementing-slos/
You encountered a major service outage that affected all users of the service for multiple hours. After several hours of
incident management, the service returned to normal, and user access was restored.
You need to provide an incident summary to relevant stakeholders following the Site Reliability Engineering recommended
practices. What should you do first?
A
Your product is currently deployed in three Google Cloud Platform (GCP) zones with your users divided between the zones.
You can fail over from one zone to another, but it causes a 10-minute service disruption for the affected users. You typically
experience a database failure once per quarter and can detect it within five minutes. You are cataloging the reliability risks of
a new real-time chat feature for your product. You catalog the following information for each risk:
Mean Time to Detect (MTTD) in minutes
Mean Time to Repair (MTTR) in minutes
Mean Time Between Failure (MTBF) in days
User Impact Percentage
The chat feature requires a new database system that takes twice as long to successfully fail over between zones. You want
to account for the risk of the new database failing in one zone. What would be the values for the risk of database failover with
the new system?
C
You are responsible for the reliability of a high-volume enterprise application. A large number of users report that an
important subset of the applications functionality a data intensive reporting feature is consistently failing with an HTTP
500 error. When you investigate your applications dashboards, you notice a strong correlation between the failures and a
metric that represents the size of an internal queue used for generating reports. You trace the failures to a reporting backend
that is experiencing high I/O wait times. You quickly fix the issue by resizing the backends persistent disk (PD). How you
need to create an availability Service Level Indicator (SLI) for the report generation feature. How would you define it?
C
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some
personally identifiable information (PII) is leaking into certain log entry fields. All PII entries begin with the text userinfo. You
want to capture these log entries in a secure location for later review and prevent them from leaking to Stackdriver Logging.
What should you do?
A
You are part of an organization that follows SRE practices and principles. You are taking over the management of a new
service from the Development Team, and you conduct a Production Readiness Review (PRR). After the PRR analysis
phase, you determine that the service cannot currently meet its Service Level Objectives (SLOs). You want to ensure that
the service can meet its SLOs in production. What should you do next?
B
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that
wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to
set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
A