Benchmarking Performance: A Human-led, AI-Augmented Protocol
This case study demonstrates how intentional data structure, purpose-built analysis environments, and grounded theory methodology can dramatically deepen and accelerate research without delegating analytical judgment to artificial intelligence.
The Problem Space & Diagnostic
Survey designed without program architecture
When the pilot survey results arrived, the Habanero project team had quantitative metrics but no framework to interpret them. They lacked a defined program architecture, established standards, or coherent longitudinal strategy to guide the client's (BC Hydro) employee experience strategy.
The survey itself was adequately designed: a 5-point Likert scale with a neutral midpoint, paired with open-response questions. But these were best-practice choices, not architectural ones. The survey existed in isolation. Foundational questions remained: What's important to measure? What key measures do we need? What data model must the survey produce to enable longitudinal benchmarking? The team had designed an adequate instrument without first designing the program it would serve.
Analysis without protocol or appropriate environment
I sat with the client's intended analyst—an HR leader with no formal research training—to understand her process. She described scrubbing the data for names and profanities, then analyzing raw data row-by-row, column-by-column, expressing how "tedious" this work was. Her analysis was confined to each column (each question), and any cross-analysis was certainly off the table.
Two problems were clear:
First, the process had no protocol. She was given no framework for systematic coding, standardized interpretation, or documented methodology. Analysis was intuitive, incomplete, and unrepeatable.
Second, the data structure was the root cause of inefficiency. The analysis file was formatted by respondent-based rows in submission order—the most arbitrary organizational scheme possible. To analyze all responses to Question 7, the analyst must scan every row. To find patterns across respondents, she must traverse the entire width of the spreadsheet—and do this row by row over a thousand times.
Fixing the first problem required defining a rigorous analysis protocol. Fixing the second required rethinking the data architecture entirely — a structural intervention that would only become possible once the survey instrument itself was redesigned.
Defining the Measurement Framework, Data Model, and Survey Redesign
The client's motivation to launch a benchmarking program wasn't about receiving an annual score or a pass/fail assessment—it was entirely about a need for nuanced, actionable insights.
Back-casting from study purpose to design the measurement framework
Meaningful insights require clarity on 4 questions: What change do we want to see? How can we measure progress and outcomes? What type of data is at play for each measure we define? What instruments/mechanisms will reliably produce these metrics?
Back-casting from what BC Hydro actually needs to know to achieve a specified outcome will shape the metrics, models, and studies of this program:
Study objective: Each annual benchmark survey is to gather employee opinion to a) accurately and consistently measure organizational alignment to company values and b) generate a grounded, nuanced understanding of explanatory themes.
Program objective: To consistently and reliably track progress and compare performance longitudinally, year over year, to generate actionable strategic insights.
The study objective signals quantitative ratings data when it states accurate and consistent measurement. Qualitative open-response data is entailed by generating a grounded, nuanced understanding of themes. A survey that intelligently mixes scale ratings with rich, descriptive contextual detail will deliver full value to both objectives. An analysis protocol that rigorously synthesizes both quantitative and qualitative insights into a coherent narrative is the explicit outcome of each study.
On a program level, consistent and reliable do a lot of heavy lifting. How we structure the data, the rating system, and survey mechanism are singularly designed and orchestrated to deliver durable and scalable results.
Being crisp on study and program objectives and what type of data and measures we'll need informed the definition of a three-tier reporting structure:
Engagement metrics track program health: participation rates (with demographic breakdowns), survey completion rates, and the completion funnel to understand where respondents drop off.
Overall response metrics show macro-level patterns: response distribution across all cultural values by layer (Company, Team—and later, People Leaders), and comment participation by topic.
Topic-level results provide the specificity leaders need: response distribution for each individual cultural value by layer, comment participation per topic, and critically, the explanatory themes extracted from qualitative responses.
This structure serves multiple audiences. Executives see high-level trends and engagement health. Department leaders see which specific values need attention in their teams. HR sees where the survey itself needs refinement based on completion patterns.
But reporting structure alone doesn't work without intentional data collection. To capture engagement metrics reliably, the data model and survey design needed to address the thousands of ambiguous empty cells. To enable topic-level analysis, we needed a data model that produced one response per row, tied to respondent metadata.
Defining the necessary data model
To enable the measurement framework, I defined a data model with two structural changes: adding a People Leaders rating layer, and grouping each value topic as a complete unit.
Adding a People Leader data layer: In project documentation and client notes, it became apparent that people leaders and managers were a salient "cultural layer" that would add valuable nuance to ratings signals. Qualitative insights from the pilot survey would later confirm this salience. So, a third rating layer was added alongside Organization and Team ratings. Each company value topic now had a ratings triad (Company | Team | People Leaders) plus an open response question.
Grouping each value topic's rating-triad and open response: This critical change ensures the downstream survey design outputs the necessary data structure we need when we reformat the raw results data from a wide, respondent-based format to a long, response-based one. This way the ratings triad and open response appear as a response group tied to their topic and corresponding respondent metadata.
Identifying flaws in the pilot scale
With the data model requirements defined, designing the survey became an exercise in reverse-engineering pre-defined outputs. The first obstacle was the pilot's existing rating scale — it couldn't produce the clean signals the data model required.
The pilot survey used a standard 5-point Likert scale with a neutral midpoint and strong agreement/disagreement at each pole. This is a reasonable choice in isolation, but it had structural flaws for this specific program.
The neutral midpoint conflated two distinct types of responses: opinions that were genuinely mixed or balanced viewpoints, and respondents who were disengaged or hadn't formed an opinion. There was no way to distinguish between them.
The scale also created another problem: blank cells appeared throughout the pilot results with no decipherable meaning. Were they skipped questions? Intentional non-responses? Survey abandonment? The data couldn't tell us.
The ratings system redesign: Why 4-point plus explicit non-response
Given the program's purpose—accurate results with actionable insights, not statistical rigour—I designed a 4-point scale of: Agree | Disagree | Mixed opinion | No comment/opinion.
At a surface-level, a respondent's decision-making experience is simplified because this scale forces commitment where it matters. Agree/Disagree/Mixed creates directional signal. There is no neutral fence-sitting option—respondents who have a view must articulate it, or select Mixed if they genuinely have a nuanced or conditional opinion or see both sides. This produces cleaner data and stronger signals.
Mixed opinion captures organizational reality. Most workplace statements about values alignment aren't strictly binary. This category legitimizes complexity without requiring respondents to pick a side they don't fully own.
No comment makes non-response explicit. In the pilot survey results, blank cells are ambiguous. In Year 2, respondents who have no opinion or choose not to answer have a logical category. This distinction is critical for longitudinal tracking: an intentional "no comment" in Year 2 can now be compared to a "no comment" in Year 3.
Removing strongly agree/disagree simplifies the scale and allows the open-response question to carry the load of strong opinions. If opinions are strong enough to warrant emphasis, respondents will provide reasons why in their comments.
The result: 4 response categories are accessible, logical, and manageable. Easy to visualize, report, and cross-tabulate with themes.
One final survey mechanism change addressed the same data-cleanliness principle at the structural level: I gated the survey so each page contained exactly one value topic — a ratings triad (required fields) plus the open-response question (optional).
Previously, questions weren't sectioned this way and all responses were optional. This created ambiguity: unclear metrics on participation, completion, abandonment, and whether empty cells represented intentionally withheld opinions or incomplete submissions. Mandatory rating fields meant that blank cells now explicitly entailed survey abandonment — a clean signal. This is thinking at the program level, not just the survey level.
Impact: Meaningful longitudinal measurement
These changes—the 4-point scale with explicit non-response, the three-layer rating structure, and the gating system—created a measurement instrument built for longitudinal integrity. The survey could now be run identically year after year, with comparable data and interpretable patterns.
More importantly, this structure centred the open-response questions as the primary insight generator. Year after year, thematic analysis of qualitative responses would reveal the texture and nuance behind the quantitative ratings. The program was designed so that qual analysis—not quant rigour—carried the interpretive load, enabling actionable insights that a score alone could never provide.
The AI-Augmented Protocol — The Analytical Architecture
Establishing the protocol: a Shu-Ha-Ri approach
The "tedious" lack of process the client described was grounded in undefined study protocol. My first move was to invoke the Shu-Ha-Ri learning framework: I offered to lead the full analysis myself and demonstrate what systematic coding, standardized interpretation, and transparent documentation methodology looks like — and how she can replicate it the following year.
Her enthusiastic acceptance gave me clear scope to define an AI-augmented workflow and technology stack that served both focused analysis/synthesis and knowledge-management needs. Open access to real-time working files, weekly check-ins to review work in progress and demonstrate analysis tasks, and a comprehensive end-to-end protocol guide would ensure she was empowered to learn continuously and hands-on — not through a one-time end-of-project download.
The root cause of inefficiency: respondent-based data structure
Despite how well-intentioned the study methodology is, no meaningful analysis is possible if the data we're working with isn't intentionally structured.
The pilot data was organized in wide format: respondent-based rows in submission order. This meant an analyst looking at Row 47 sees one respondent's answers to all questions scattered across columns. To analyze all responses to Question 1, the analyst must scan every row. To find patterns across respondents, she must traverse the entire width of the spreadsheet. The arbitrary ordering (by submission date) meant there was no natural focus—no way to cluster related responses together.
The data restructuring solution: wide format to long format
Because the survey was designed with an intentional data structure—every field deliberately included, nothing left to habit or one-off requests—we had a clean foundation to restructure the raw data.
The reshaping moved us from respondent-centric to response-centric organization. Each response now gets its own row, tied to a unique respondent ID that contains all relevant participant metadata. This single structural change—indexing by respondent ID instead of submission date—transformed the data into a queryable format.
Now analysis became possible: Group all responses to a specific value together. Filter by respondent demographic or rating level. Sort by question or theme instead of submission order. Cross-analyze patterns without scanning the entire spreadsheet width. This wasn't a convenience—it was the foundation that made systematic analysis possible.
Generating the pre-analytically structured data: AI as a tool for data preparation
The survey design and data reshaping solved the structural problem. But the HR analyst — working with a much larger project portfolio alongside this one — still faced a practical constraint: time.
I had previously developed a generative AI task flow using Copilot's Analyst Agent that could produce this pre-analytically structured data in under 2 minutes. The workflow assigned a unique ID_Key to each respondent, then reshaped the data from wide to long format, creating a purpose-made import file ready for analysis.
This was AI augmenting expertise, not replacing it. The human still defined what mattered, what questions to ask, what the data should look like. AI handled the mechanical task of reshaping at scale.
Condens.io as purpose-built analysis environment
With pre-analytically structured data in hand, we needed an environment designed for systematic analysis — not Excel.
Condens is primarily known for analyzing video or audio transcripts, but its architecture proves powerful for survey analysis. The platform's strength lies in how it handles the intentional data structure we first defined. We can index and sort the response table to surface the most urgent or salient topics and demographics, understand the exact count of responses per group, and accurately plan and estimate task time — a crucial capability for agile teams and executives needing to prioritize insights by impact.
My workflow in Condens: isolate the response group to analyze, open each response as a session, read and parse the verbatim, add notes, code for theme, and assign to clusters or directly to a Miro whiteboard built for synthesis. This session-level view is a game-changer compared to parsing and coding directly in Excel. 2 key synthesis artefacts emerged from this workflow:
- Stats artefact — report card benchmarking metrics with qualitative insights explaining the why
- Clusters artefact — themes and sub-groupings organized under each parent cluster
In this way, analysis and synthesis happened together — systematic building of results as we worked through each response.
Result: comprehensive analysis of 1,000+ responses in 3x less time
I kept detailed logs documenting each day's task, the specific response groups sorted and filtered, their counts, start and end times. Based on these logs, I completed end-to-end analysis of 1,000+ responses in 19 hours versus an estimated 60–70 hours using manual Excel color-coding and pivot tables.
Two important caveats: First, the 3x efficiency gain says nothing about the vastly greater depth and quality of analysis from this approach — the qualitative improvement is far greater than any time metric can capture. Second, this 3x may be modest relative to the HR analyst's experience level. I have 12 years analyzing data, which makes me more efficient from the start. Applied to her skill set, the efficiency gains could be significantly larger.
What this demonstrates is that the combination of intentional data structure, a purpose-built analysis environment, and a systematic protocol enabled comprehensive analysis that would have been impossible in Excel. The AI tool (Copilot) prepared the data. The architecture (long format, respondent ID, queryable structure) made analysis possible. Condens provided the environment for systematic, transparent work.
The Lexicon Flywheel — Building for Iteration
After delivering the pilot analysis, I built a lexicon file system designed to accelerate future analysis cycles. Grounded entirely in human-verified, fully coded results from the previous cycle, the lexicon guides generative AI to pre-assign themes with increasing accuracy in Year 2 and beyond — allowing researchers to spend less time on mechanical coding and more on deep insight generation.
The mechanism is straightforward: Year 1's analyzed, themed, and verified results are exported to a Lexicon.md file—a structured repository of keywords, phrasing patterns, and thematic clusters that emerged from human analysis. In Year 2, this lexicon informs the AI's initial theme assignment on new survey responses. A researcher then reviews and refines these pre-assignments rather than coding from scratch. Each cycle, the lexicon grows richer, the pre-assignments get faster and more accurate, and the researcher's path to deep analysis shortens.
By Year 3+, the system becomes self-accelerating: human researchers validate and enrich the lexicon, which then guides faster AI pre-analysis, which surfaces deeper patterns for human interpretation. This isn't about replacing researchers with AI — it's about applying theoretical expertise and practical experience to workflows that amplify research capability and expand the ground a researcher can cover in a given timeframe.