Rubrik Troubleshooting platform

Crafted an enterprise platform from 0 to 1 for Rubrik engineers to troubleshoot backend jobs more efficiently
Sole designer | Collaborated with PM, Engineer, Design System | 2 months

🎯 Background

Customer support engineers at Rubrik struggled to troubleshoot customer jobs efficiently, leading to delayed responses.

🛠️ Action

Leading this platform’s design, I first resolved complexity and ambiguity through user research.

Then I reduced repetitive actions with Streamlined workflow, organized scattered information with Clear Info Architecture, and improved unreadable data with Decision-driven visualization.

📈 Impact

Estimated to reduce the Troubleshooting Time by 74.2% for 300–400 engineers, and achieved 87.7% User Satisfaction.

Contribution

Sole designer, mentored by a senior

Length

June - July, 2024 (2 months)

Collaboration

PM, Engineer, Design System

Highlighted skills

Tackle complexity & ambiguity
Rational design decisions
Visual craftsmanship

*For best viewing exprience on mobile, please switch your phone to landscape mode

Project Content
Project Content

I. Background

Background

Background

Customer support engineers troubleshoot problematic jobs for Rubrik customers

6,000 customers around the globe rely on Rubrik's product to backup and recover their data
However, sometimes, these backup and recovery jobs fail for various reasons
However, sometimes, these backup and recovery jobs fail for various reasons
When that happens, it’s the responsibility of customer support engineers to diagnose and fix these problems

Rubrik troubleshoot platform aims to increase the efficiency of engineers to deal with excessive requests

There are too many customer requests coming in every day, and engineers can’t handle them efficiently
That's why Rubrik decided to build a platform to improve the efficiency of the troubleshooting process
That's why Rubrik decided to build a platform to improve the efficiency of the troubleshooting process

II. Research

Starting the project, I clarified the ambiguous scope as the "happy path" for the hardware system by leading stakeholder discussions

🤔
Different stakeholders had different ideas about the features and the systems we need to design
🤝🏻
To solve this, I organized 5+ stakeholder meetings to facilitate discussion
🚀
We agreed to prioritize the design for the hardware system first, particularly a ‘happy path’, the ideal workflow that satisfy major use needs

At the same time, I quickly gained domain expertise of backend troubleshooting and discovered insights about "Cluster"

🤯
The system was technically dense, and I knew nothing about backend troubleshooting at the start
📚
To overcome this, I quickly conducted desk research and expert interviews to learn the basics
💡
I discovered that engineers don’t diagnose jobs only — they also diagnose clusters, the worker that runs the jobs. If a cluster fails, all jobs might fail

With defined scope and solid knowledge, I gathered user insights by research to facilitate design decisions

Through 1-on-1 interviews with users, I further differentiated the personas
Troubleshoot 90% of the cases
Less domain knowledge
Investigate 10% of the escalated cases
More domain knowledge
From contextual inquiries, I identified 3 use cases for the platform
Among them, troubleshoot a job is the key use case determining engineers' efficiency

So why engineers can't troubleshoot efficiently? I found 3 major pain points

Currently, engineers are manually entering commands into the Terminal, leading to 3 problems

Repetitive actions

Things that can be achieved by 1 button needs multiple input of code pieces, which is time-consuming

Scattered info

Engineers can’t see relevant info they need at once because each command only returns part of the data

Unreadable data

The terminal’s text is tiny and cluttered, which is difficult to scan quickly

III. Design

To address repetitive actions, I designed a streamlined workflow that reduces unnecessary steps

Why repetitive actions?

The root problem of repetitive action was the the step-by-step nature of the terminal process, where engineers can only proceed in a fixed order
To solve this, I researched and summarized 3 major user flows of how engineers troubleshoot in different ways upon different types of customer requests

Unclear problems

Locate the problem by selecting Job Types
Customer's input
Engineer's goal
Engineer's user flow

Unclear problems

Locate the problem by selecting Job Types
"I got some problematic jobs."
"Where is the problem?"

Multiple problems

Find the reason by filtering Job States
"I got 10 problematic jobs, their IDs are …"
"Why did multiple jobs fail?"

Single problems

Directly search the problem and drill down to fix it
"1 job is stuck, the id is ..."
"What’s wrong with this job?"
The design cuts down unnecessary steps and helps engineers find useful information faster and simpler

To organize scattered info, I crafted clear Info Architecture to let engineers easily view relevant info

Why scattered info?

Information was scattered because there are too much info and too little organization. Therefore, I decided to find the most suitable info architecture for engineers
In order to gather valid user feedback, I used low-fidelity wireframes to visualize different design alternatives

1 key design decision I made is to add a data dashboard page

Before | Only Job and Cluster

Reason

Pros & Cons

✅  Fulfill 2 major use cases
❌  Will miss the info of failed nodes
❌  Cannot hold all types of information

Decision

User feedback: Checking node health is always engineers' first step, but it is not shown on the first page. Besides, 2 tabs are insufficient to hold all info they need.

After | Add a data dashboard

✅  General diagnosis & hold more info
✅   1+ step but acceptable
Decision: Users said that they do need a data dashboard for general diagnosis, because it can give an overview to quickly identify problems and supports future feature expansions

To make data readable, I adopted decision-driven visualization that emphasized key info to help engineers make decisions efficiently

Why unreadable data?

It is obvious that the terminal's data is lack of visualization. So my strategy was to only emphasize key info that is necessary for them to make decisions
Iteration details

How to let users drill down quickly?

Content

A job is made up with a group of instances. If a job failed, it's because some of it instances failed.

After locating a problematic job, how can engineers drill down to view all the instances of this job?

01  |  Tree view

✅  Show a clear hierarchy
❌  1 job can have 1000 instances to expand
Decision: After knowing the number of instances to expand, I realized that the super long list will make the page unreadable

02  |  Interactive select

✅  Support quick switch between job & instances
❌  Quick switching is not a frequent use case
Decision: In later feedback, I realized that quick switching is not a frequent use case so this view might be too crowded

03  |  Linear drill-down (final choice ⭐)

✅  Can hold more information on 1 page
✅  Intuitive even takes a few more clicks
Decision: After presenting to users, the feedback is that this pattern is the most intuitive even though it takes a few more clicks

Design 2: After 3 stages of iterations, I crafted a data visualization component that help engineers reduce distraction and speed up decision-making

After research, categorization, color-coding, and iterating, this new UI better guides engineers' attention and help them find 1 problematic job from 1000 jobs

Before: 18 messy states

❌  Need to read the names carefully
❌  Messy & easy to make mistakes

After:  Clear visualization

✅  Can scan quickly
✅  Clear & more accurate
Iteration details

How to find 1 problematic job from 1000?

Content

When looking at 1000 instances of 1 job, how does engineers know which instance to pick?

The answer is looking at the metrics "recent instance states". When many recent instances have abnormal behavior, the job is likely to have problems

So how can I make this decision quicker when the page is full of information?

3 design stages

Level 1  |  What info is important?

There are 19 states in total, do we show all these states?

After comparing alternatives, I decided on only displaying recent problematic states to direct users' attention to the things that are relevant to troubleshooting

Level 2  |  Categorizing & color coding

Among the 19 states, there are 7 problematic states. How do users know what they should pay attention to?

I categorized states in terms of severity, and used color coding to represent only the states that need attention

Level 3  |  Compare components & iterate

Lastly, there is no developing resources for this internal project to develop new components.

So apart from brainstorming new components, I focused more on comparing the existing components and iterating them

Design outcome

After 100+ alternatives and 3 rounds of critique, I delivered the final prototype. The platform will be developed in Oct 2024

IV. Impact

According to the tests of 31 users, the design enabled 300+ Rubrik engineers to troubleshoot more efficiently

It also received positive feedback from end-users

I also learned a lot in terms of design logics and reflected on what I could have done better

ToB user-centered thinking

- Ask right questions
- Cosplay the users

Think more & scope down

- Ask more “whys”
- Be clear about the constraints

Rational design decision

- Explore alternatives in early stages
- Break down things into pieces
Rubrik Troubleshooting Platform
BackgroundResearchDesignImpact

Rubrik Troubleshooting Platform

🔝