LLMs achieved comparable F-scores to human raters in identifying usability issues from 300 app reviews?

LLMs achieved comparable F-scores to human raters in identifying usability issues from 300 app reviews

Researchers built a dataset using Nielsen's 10 Usability Heuristics for prompt engineering?

Researchers built a dataset using Nielsen's 10 Usability Heuristics for prompt engineering

Prompt design significantly impacts LLM performance in this task?

Prompt design significantly impacts LLM performance in this task

Developer Tools

Study shows LLMs can extract usability insights from app reviews

arXiv cs.SE May 14, 2026

⚡New research demonstrates LLMs can analyze 300 app reviews with human-comparable accuracy

Deep Dive

A new study provides a dataset of 300 user reviews labeled by two human raters and an LLM, and finds that LLMs can generally recognize usability as a non-functional requirement based on their F-score, though performance and reliability strongly depend on the prompt. Using prompt engineering derived from Nielsen’s heuristics, the workflow presents a quicker, cheaper alternative to traditional ML approaches for processing user requirements.

Key Points

LLMs achieved comparable F-scores to human raters in identifying usability issues from 300 app reviews
Researchers built a dataset using Nielsen's 10 Usability Heuristics for prompt engineering
Prompt design significantly impacts LLM performance in this task

Why It Matters

LLMs could revolutionize product development by automating usability feedback analysis at scale

Read Original Article

Study shows LLMs can extract usability insights from app reviews

Why It Matters

Related Articles

🚀 Stay Ahead in AI