Rubrik v5

I. Background

Background

Customer support engineers troubleshoot problematic jobs for Rubrik customers

6,000 customers around the globe rely on Rubrik's product to backup and recover their data

However, sometimes, these backup and recovery jobs fail for various reasons

When that happens, it’s the responsibility of customer support engineers to diagnose and fix these problems

Rubrik troubleshoot platform aims to increase the efficiency of engineers to deal with excessive requests

There are too many customer requests coming in every day, and engineers can’t handle them efficiently

That's why Rubrik decided to build a platform to improve the efficiency of the troubleshooting process

II. Research

Starting the project, I clarified the ambiguous scope as the "happy path" for the hardware system by leading stakeholder discussions

🤔

Different stakeholders had different ideas about the features and the systems we need to design

🤝🏻

To solve this, I organized 5+ stakeholder meetings to facilitate discussion

🚀

We agreed to prioritize the design for the hardware system first, particularly a ‘happy path’, the ideal workflow that satisfy major use needs

At the same time, I quickly gained domain expertise of backend troubleshooting and discovered insights about "Cluster"

🤯

The system was technically dense, and I knew nothing about backend troubleshooting at the start

📚

To overcome this, I quickly conducted desk research and expert interviews to learn the basics

💡

I discovered that engineers don’t diagnose jobs only — they also diagnose clusters, the worker that runs the jobs. If a cluster fails, all jobs might fail

With defined scope and solid knowledge, I gathered user insights by research to facilitate design decisions

Through 1-on-1 interviews with users, I further differentiated the personas

Troubleshoot 90% of the cases
Less domain knowledge

Investigate 10% of the escalated cases
More domain knowledge

From contextual inquiries, I identified 3 use cases for the platform

Among them, troubleshoot a job is the key use case determining engineers' efficiency

So why engineers can't troubleshoot efficiently? I found 3 major pain points

Currently, engineers are manually entering commands into the Terminal, leading to 3 problems

Repetitive actions

Things that can be achieved by 1 button needs multiple input of code pieces, which is time-consuming

Scattered info

Engineers can’t see relevant info they need at once because each command only returns part of the data

Unreadable data

The terminal’s text is tiny and cluttered, which is difficult to scan quickly

III. Design

To address repetitive actions, I designed a streamlined workflow that reduces unnecessary steps

Why repetitive actions?

The root problem of repetitive action was the the step-by-step nature of the terminal process, where engineers can only proceed in a fixed order

To solve this, I researched and summarized 3 major user flows of how engineers troubleshoot in different ways upon different types of customer requests

Unclear problems

Locate the problem by selecting Job Types

Customer's input

Engineer's goal

Engineer's user flow

Unclear problems

Locate the problem by selecting Job Types

"I got some problematic jobs."

"Where is the problem?"

Multiple problems

Find the reason by filtering Job States

"I got 10 problematic jobs, their IDs are …"

"Why did multiple jobs fail?"

Single problems

Directly search the problem and drill down to fix it

"1 job is stuck, the id is ..."

"What’s wrong with this job?"

Based on these, I designed troubleshooting page flows that are tailored to user flows and flexible to switch between

The design cuts down unnecessary steps and helps engineers find useful information faster and simpler

To organize scattered info, I crafted clear Info Architecture to let engineers easily view relevant info

Why scattered info?

Information was scattered because there are too much info and too little organization. Therefore, I decided to find the most suitable info architecture for engineers

In order to gather valid user feedback, I used low-fidelity wireframes to visualize different design alternatives

1 key design decision I made is to add a data dashboard page

Before | Only Job and Cluster

Reason

Pros & Cons

✅ Fulfill 2 major use cases

❌ Will miss the info of failed nodes

❌ Cannot hold all types of information

Decision

User feedback: Checking node health is always engineers' first step, but it is not shown on the first page. Besides, 2 tabs are insufficient to hold all info they need.

After | Add a data dashboard

✅ General diagnosis & hold more info

✅ 1+ step but acceptable

Decision: Users said that they do need a data dashboard for general diagnosis, because it can give an overview to quickly identify problems and supports future feature expansions

The final IA is intuitive for users to find info across pages

It also empowers engineers to view related information side-by-side on the same page, making it much easier to understand the problem

To make data readable, I adopted decision-driven visualization that emphasized key info to help engineers make decisions efficiently

Why unreadable data?

It is obvious that the terminal's data is lack of visualization. So my strategy was to only emphasize key info that is necessary for them to make decisions

Design 1: After balancing different alternatives, I designed a linear drill down layout that helps engineers navigate to details intuitively and quickly

This layout gives engineers adequate information to perform the next step

It is forms a simple and straightforward flow that engineers can easily understand

Iteration details

How to let users drill down quickly?

Content

A job is made up with a group of instances. If a job failed, it's because some of it instances failed.

After locating a problematic job, how can engineers drill down to view all the instances of this job?

01 | Tree view

✅ Show a clear hierarchy

❌ 1 job can have 1000 instances to expand

Decision: After knowing the number of instances to expand, I realized that the super long list will make the page unreadable

02 | Interactive select

✅ Support quick switch between job & instances

❌ Quick switching is not a frequent use case

Decision: In later feedback, I realized that quick switching is not a frequent use case so this view might be too crowded

03 | Linear drill-down (final choice ⭐)

✅ Can hold more information on 1 page

✅ Intuitive even takes a few more clicks

Decision: After presenting to users, the feedback is that this pattern is the most intuitive even though it takes a few more clicks

The final design

Design 2: After 3 stages of iterations, I crafted a data visualization component that help engineers reduce distraction and speed up decision-making

After research, categorization, color-coding, and iterating, this new UI better guides engineers' attention and help them find 1 problematic job from 1000 jobs

Before: 18 messy states

❌ Need to read the names carefully

❌ Messy & easy to make mistakes

After: Clear visualization

✅ Can scan quickly

✅ Clear & more accurate

Iteration details

How to find 1 problematic job from 1000?

Content

When looking at 1000 instances of 1 job, how does engineers know which instance to pick?

The answer is looking at the metrics "recent instance states". When many recent instances have abnormal behavior, the job is likely to have problems

So how can I make this decision quicker when the page is full of information?

3 design stages

Level 1 | What info is important?

There are 19 states in total, do we show all these states?

After comparing alternatives, I decided on only displaying recent problematic states to direct users' attention to the things that are relevant to troubleshooting

Level 2 | Categorizing & color coding

Among the 19 states, there are 7 problematic states. How do users know what they should pay attention to?

I categorized states in terms of severity, and used color coding to represent only the states that need attention

Level 3 | Compare components & iterate

Lastly, there is no developing resources for this internal project to develop new components.

So apart from brainstorming new components, I focused more on comparing the existing components and iterating them

Design outcome

After 100+ alternatives and 3 rounds of critique, I delivered the final prototype. The platform will be developed in Oct 2024

If the prototype doesn't load, please click here for the link

IV. Impact

According to the tests of 31 users, the design enabled 300+ Rubrik engineers to troubleshoot more efficiently

It also received positive feedback from end-users

I also learned a lot in terms of design logics and reflected on what I could have done better

ToB user-centered thinking

- Ask right questions
- Cosplay the users

Think more & scope down

- Ask more “whys”
- Be clear about the constraints

Rational design decision

- Explore alternatives in early stages
- Break down things into pieces

Rubrik Troubleshooting Platform

Background Research Design Impact

Rubrik Troubleshooting platform

🎯 Background

🛠️ Action

📈 Impact

Contribution

Length

Collaboration

Highlighted skills

Project Content

Project Content

I. Background

Background

Background

Customer support engineers troubleshoot problematic jobs for Rubrik customers

Rubrik troubleshoot platform aims to increase the efficiency of engineers to deal with excessive requests

II. Research

Starting the project, I clarified the ambiguous scope as the "happy path" for the hardware system by leading stakeholder discussions

At the same time, I quickly gained domain expertise of backend troubleshooting and discovered insights about "Cluster"

With defined scope and solid knowledge, I gathered user insights by research to facilitate design decisions

So why engineers can't troubleshoot efficiently? I found 3 major pain points

Repetitive actions

Scattered info

Unreadable data

III. Design

To address repetitive actions, I designed a streamlined workflow that reduces unnecessary steps

Why repetitive actions?

Unclear problems

Unclear problems

Multiple problems

Single problems

To organize scattered info, I crafted clear Info Architecture to let engineers easily view relevant info

Why scattered info?

1 key design decision I made is to add a data dashboard page

Before | Only Job and Cluster

Reason

Pros & Cons

Decision

After | Add a data dashboard

To make data readable, I adopted decision-driven visualization that emphasized key info to help engineers make decisions efficiently

Why unreadable data?

Design 1: After balancing different alternatives, I designed a linear drill down layout that helps engineers navigate to details intuitively and quickly

How to let users drill down quickly?

Content

01 | Tree view

02 | Interactive select

03 | Linear drill-down (final choice ⭐)

The final design

Design 2: After 3 stages of iterations, I crafted a data visualization component that help engineers reduce distraction and speed up decision-making

Before: 18 messy states

After: Clear visualization

How to find 1 problematic job from 1000?

Content

3 design stages

Level 1 | What info is important?

Level 2 | Categorizing & color coding

Level 3 | Compare components & iterate

Design outcome

After 100+ alternatives and 3 rounds of critique, I delivered the final prototype. The platform will be developed in Oct 2024

IV. Impact

According to the tests of 31 users, the design enabled 300+ Rubrik engineers to troubleshoot more efficiently

It also received positive feedback from end-users

I also learned a lot in terms of design logics and reflected on what I could have done better

ToB user-centered thinking

Think more & scope down

Rational design decision

Rubrik Troubleshooting Platform