Our Approach to Incident Response and Communications

You've entrusted Box as a valued service provider and partner. You should also trust us to let you know when something is being impacted. With this in mind, we aim to ensure that you are informed of what's happening with and within the Box Services, especially planned maintenances as well as unexpected service interruptions and outages.

 
Responding to Events impacting the Customer Experience
When our Network Operations Center (NOC) recognizes a potential service issue, an investigation promptly begins in earnest to identify the root cause and remediate. Every time this process commences, one of the earliest tasks is to identify the estimated or known impact to the customer experience. This activity directly supports a quick turnaround to update the Box Status Site.
 
To do this in the most efficient manner, we leverage the following guide for when and what to post:
 
Screen Shot 2018-09-26 at 3.13.01 PM.png
 
During an active investigation, the customer impact is continually reviewed and evaluated against the aforementioned guidelines. As new information is discovered, the impact/severity is adjusted accordingly.
 
On the Box Status Site, our objective is to provide regular updates, at least every 30-60 minutes, or at the next significant change in status. These updates will include any known impact to the customer experience, including the affected service(s) and approximate times of impact. Where feasible, we will share additional information about potential workarounds, remediation progress/actions, and estimated time to recovery.
 
Stages of an Incident
With a few exceptions, most events on the Box Status Site follow a four stage process:
 
  • Investigating - Most events begin here when we receive the initial notice of a disruption to the customer experience. The status may remain here as we determine what led to the current state and can identify an action plan to return availability/stability of the affected service(s).
  • Identified - As soon as a proximal cause is understood, we quickly move to remediate the problem and ensure that action prevents a reoccurrence.
  • Monitoring - When remediation has been completed and an analysis returns positive signs of the service(s) being restored to expected levels, we move into this stage. For some events, this period may be extended to ensure the impact of the remediation is observed across multiple timelines and other criteria.
  • Resolved - A designation that the customer experience has returned to expected levels based on the results observed during the monitoring period.
 
Status of our Service(s) and Subcomponents
In combination with the aforementioned Incident stages, we will also do our best to indicate the impact to a specific service or subcomponent. These states include:
 
  • Operational (green dot) - Our services are online and functioning within expected norms. Sometimes we may still be in an active investigation or nearing the final stages and will indicate that the service is operational.
  • Degraded (orange/yellow dot) - Our services are not performing at the levels we expect. Examples include higher than expected latency loading the Box Web Application or elevated intervals to receive new events/changes in our desktop clients (Sync and Drive). 
  • Outage (red dot) - Our services may be experiencing a full or partial outage, preventing customers from completing their journey or accessing critical components within Box.
 
Monitoring the Customer Experience
As a core part of our operations, we continuously measure ourselves against two primary outcomes - the availability of our service and a more holistic customer experience measurement. Our Premier customers (learn more here) are most familiar with the former objective, also known as site uptime. The latter is something for which we hold ourselves accountable to ensure we meet the most critical, end-to-end needs of our customers. Simply said, our tracking extends beyond SLA commitments.
 
To achieve these results, we monitor and protect the customer experience by using a variety of continuous monitoring and alerting tools. We have a number of checks that run on each individual Box server, a system of synthetic monitoring* agents and a collection mechanism that examine real-time user transactions. These checks give us leading indicators of issues or degradation that may be impending, but have not yet impacted customers interacting with our services. These checks measure the health and performance of the subcomponents on top of which Box runs its services, and allow us to prevent events from becoming incidents. The output from each check also feeds into a time-series database which allows us to see trends over time.
 
We also collect data that tracks real user transactions. This time-series data continually monitors and collects actual Box user’s interactions with all of our services, on all of our hosts and is an indicator of what customers are really seeing when they interact with Box. It gives us information about total number of errors being encountered with our services at any given point in time and is a more accurate measure of how many users are completing their journeys successfully.
 
*Synthetic monitoring is a form of website monitoring that uses scripted actions in an emulated web browser to imitate key customer journeys such as logins or shared link previews. Our synthetic clients execute a wide variety of checks at one minute intervals from various internal and external points. This allows us to identify whether the Box service is responding appropriately to a given set of inputs, from a variety of geographic locations.
 
Scheduled/Planned Maintenance
On occasion, we need to execute changes in our data center(s) or to specific services. While we never anticipate adverse user impact or downtime from these activities, we want to be transparent that something of this nature is taking place. Throughout the maintenance window and immediately afterward, we will closely monitor the status of the Box Services. Updates on the maintenance as well as any changes in status of our service(s) will be shared through the Box Status Site.
Version history
Revision #:
1 of 3
Last update:
2 weeks ago
Updated by:
 
Contributors

More in this guide:


Go to Guide

Users online (380)