可用性测试：衡量用户体验的关键指标

Usability testing should always be part of product development. Ideally, it is conducted before release, but in practice it can be useful at different stages. It’s worth pointing out that, “testing” isn’t a goal in itself. Without knowing exactly what you want to measure, you risk ending up with feedback you can’t interpret or act on.

Every usability study can focus on a different aspect of the experience, and the right metrics depend on your research goals. This article outlines some of the most common usability testing metrics and how they can be applied in practice.

Scenario

To make the following metrics easier to follow, let’s use a single scenario throughout:

Suppose your team has just released a new “Add to Wishlist” feature in an e-commerce app. You want to understand whether users can find and use it without confusion.

For this, you recruit 5 participants and give them realistic tasks, such as “Find a product you like and add it to your wishlist.” As we go through the metrics, we’ll refer back to this example to see how each measurement works in practice.

1. Task Success Rate

Task success rate is the most widely used and arguably the most important usability testing metric. It’s a simple yes-or-no measure: were participants able to complete the task or not?

Most product teams often set a benchmark for this metric, such as requiring at least 80% of participants to succeed before a feature is considered ready for launch. The exact threshold depends on the product, its complexity, and how critical the task is.

In our example, the task is: “Find a product you like and add it to your wishlist.” If 4 out of 5 participants complete it successfully, your task success rate is 80%. If only 2 succeed, you immediately know the design needs significant improvement before release.

2. Time on Task

Time on task measures how long it takes participants to complete the task. Unlike task success rate, which is binary, this metric shows efficiency. Even if users eventually succeed, taking too long can be just as problematic as failing altogether.

Teams often set benchmarks for how much time a task should reasonably take. For simple interactions, the expectation might be less than a minute. If participants consistently take 5 or more minutes, you can assume that real users who are far less patient in wild would likely give up before finishing.

In our wishlist example, you might expect users to spot the “Add to Wishlist” button within 5 seconds of opening a product page. If a participant spends a full minute scanning the screen before finding it, that’s a clear signal the feature isn’t obvious enough.

When documenting this metric, simply run a timer during each attempt. For example:

Participant 1: 10 secondsParticipant 2: 70 secondsParticipant 3: 18 seconds

Time on task is also useful for tracking learning curves. For example, a participant might need 70 seconds on their first attempt, but only 30 seconds the second time. That difference shows how quickly users adapt, which is valuable for features that require repeat usage.

3. Error Rate

Error rate tracks how often participants make mistakes while completing a task. A task can still be completed successfully, but if users stumble through errors along the way, that’s a sign the design isn’t intuitive.

Errors can include things like:

Clicking the wrong buttonMisinterpreting an icon or labelNavigating to the wrong screen before correcting themselvesRepeatedly trying an action that doesn’t work

In our wishlist example, the task is: “Find a product you like and add it to your wishlist.” Suppose one participant tries tapping the product image itself instead of the “Add to Wishlist” icon, or another participant keeps adding the item to favorites instead of wishlist. Even if they eventually manage to add an item to the wishlist, those detours count as errors.

To document this metric, you simply note down the mistakes for each participant. For example:

Participant 1: 2 errors (tapped product image, opened cart)Participant 2: 0 errorsParticipant 3: 1 error (confused by unlabeled heart icon)

High error rates suggest that users are guessing rather than confidently interacting. Even if task success and time on task look acceptable, frequent errors can lead to frustration and reduce long-term satisfaction with the product.

4. Number of Assists

The number of assists measures how often a facilitator or moderator needs to step in and help a participant complete a task. In an ideal world, participants should be able to figure things out on their own. Every time you have to explain, clarify, or point something out, it signals a usability issue.

In our wishlist example, the task is: “Find a product you like and add it to your wishlist.” If a participant stares at the screen for over a minute and finally asks, “Where is the wishlist button?” and you have to show them counts as an assist.

Documenting this is simple:

Participant 1: 0 assistsParticipant 2: 1 assist (needed help locating the icon)Participant 3: 2 assists (asked what “wishlist” means, then couldn’t find the button)

Tracking assists is valuable because it shows where users would likely get stuck in real-world conditions where no one is there to guide them. Even if task success looks high, a design that depends heavily on assistance isn’t usable in practice.

5. Path Deviation

Path deviation measures how closely participants’ actions match the intended or optimal path you designed for completing a task. In other words, it tracks the detours users take.

A task can still be completed successfully, but if participants wander through unnecessary screens, click unrelated elements, or backtrack multiple times, it suggests your design isn’t guiding them clearly enough.

In our wishlist example, the optimal path might be:

Open a product pageClick the “Add to Wishlist” button

If a participant instead goes:

Home page → Cart → Back to Home → Product page → Add to Wishlist, that’s a path deviation. They got there in the end, but the extra steps reveal friction.

You can document this by mapping each participant’s actual path and comparing it to the intended one. For instance:

Participant 1: Followed optimal path (0 deviations)Participant 2: Went to Cart first (1 deviation)Participant 3: Opened Search → Home → Product page → Wishlist (2 deviations)

High deviation rates don’t always mean failure, but they do show where your design is misleading or cluttered. Reducing unnecessary detours makes the product faster, smoother, and less frustrating.

6. Perceived Ease of Use

Perceived ease of use is a simple attitudinal metric: after completing a task, you ask participants “On a scale of 1 to 10, how easy or difficult was this task?” The score reflects their subjective impression, not just whether they managed to finish.

For example, after asking participants to add an item to their wishlist, you might collect ratings like:

Participant 1: 9Participant 2: 7Participant 3: 5

You can then calculate the average across all participants. Many teams set a minimum threshold (for example, an average of 6.5 or higher) as a goal.

This metric matters because people’s perception of difficulty often shapes their willingness to return. Even if they succeed quickly, if they feel the task was confusing, they’re less likely to trust or enjoy using the feature.

7. Error Recovery Rate

Error recovery rate measures how often participants are able to recover from mistakes on their own during a task. While error rate tells you how many mistakes happen, this metric shows whether users can recognize the error and fix it without external help.

In our wishlist example, a participant might first tap the shopping cart instead of the “Add to Wishlist” button. If they realize the mistake, go back, and then successfully add the product to the wishlist, that counts as an error recovery. If they get stuck or abandon the task, that’s a failed recovery.

You can document this by tracking errors alongside recovery attempts:

Participant 1: 2 errors, 2 recoveries (100% recovery rate)Participant 2: 1 error, 0 recoveries (0% recovery rate)Participant 3: 3 errors, 2 recoveries (67% recovery rate)

An important detail is whether users even notice they’ve made an error. If they don’t realize it, that can be a serious problem. For example, if someone taps “Add to Wishlist” but nothing happens and they assume it worked, you need to make the interaction clearer through confirmation messages, animations, or visual state changes. A lack of awareness means errors go uncorrected, and the system silently fails the user.

It’s also worth noting that error recovery isn’t equally critical for every product. For some apps, a missed wishlist item isn’t the end of the world. But in domains like financial services, healthcare, or insurance, making sure users immediately recognize and recover from errors is essential. The consequences of unnoticed or unrecoverable errors in these contexts can be far more serious than simple frustration.

8. Confidence Level

Confidence level measures how confident participants feel about the actions they took during a task. Even if they completed it successfully, a low confidence score suggests they weren’t sure they did the right thing which often translates into hesitation or second-guessing in real use.

This is typically measured by asking a simple post-task question like:
“On a scale of 1 to 5 (or 1 to 10), how confident are you that you completed the task correctly?”

In the wishlist example, a participant might add an item but not see a confirmation message or clear animation. As a result, they think, “I’m not sure if it worked.” That’s a low confidence score, even though technically the item was added. Another participant might click “Add to Favorites” instead of “Add to Wishlist” and believe they succeeded. In that case, confidence might be high, but the task success rate is actually a failure because they didn’t complete the correct action.

Documenting this can look like:

Participant 1: Confident (9/10)Participant 2: Unsure (4/10, didn’t notice confirmation)Participant 3: Confident but incorrect (8/10, used Favorites instead of Wishlist)

Confidence level is valuable because it reveals gaps between what users think happened and what actually happened. Low confidence often points to poor feedback or unclear system status. False confidence exposes mislabeling or misleading design choices. Both are signals to refine clarity and trust in the interface.

Documenting Usability Metrics

Collecting the data is only half the job; documenting it in a consistent, readable way is just as important. You don’t need specialized software. Simple spreadsheets, Notion tables, or even pen and paper all work. The key is to be consistent and leave space for additional notes or observations.

Here’s an example of how a simple table might look:

For usability testing with a larger number of participants, it can also be useful to record demographics such as age, gender, or level of digital experience. This helps you see whether certain usability issues correlate with specific groups. For example, adding an item to a wishlist may feel completely natural for Gen Z participants but less intuitive for Gen X, revealing design assumptions that might exclude part of your audience.

Usability testing is more than just watching people interact with your product , it’s about measuring, though careful observation is still a critical part of the process. Metrics turn those observations into actionable insights. The key is to focus on the ones that align with your goals, document them consistently, and look for patterns across participants.

If you’d like to dive deeper into strategies for running stronger usability studies, check out my article: 5 Ways to Improve Your Usability Studies.

Usability Testing Metrics and How to Use Them was originally published in UX Planet on Medium, where people are continuing the conversation by highlighting and responding to this story.