Microsoft study finds AI mindset training doubled output quality at Gap Inc.

A field experiment run by Microsoft Research with 388 Gap Inc. employees found that reframing AI as a thought partner produced better individual work than standard tool training, and that mandating structured collaborative protocols may do more harm than good,

Gap Inc. partnered with Microsoft Research and AI Mindset on a field experiment testing how different AI training approaches affect the quality of employee output when using Microsoft Copilot Image credit: Gap Inc.

A Microsoft Research study has found that teaching employees to think differently about AI, rather than training them on its features, was associated with significantly higher quality individual output when using Microsoft Copilot.

The research, conducted with Gap Inc. and training company AI Mindset has direct implications for how organizations design AI training and roll out tools across their workforces.

The study tested two approaches to supporting human-AI collaboration among 388 full-time Gap Inc. employees. All participants had identical access to Copilot. The only variable was the structure surrounding how they used it.

Structured protocols backfired — reframing worked

The experiment ran two sequential tasks, each testing a different kind of support.

The first paired employees together and required them to follow a structured "Create-Out-Loud" protocol: meet via Microsoft Teams, discuss a strategic plan verbally, generate a transcript, then prompt Copilot to draft a document from it. The second was an individual task where some participants received standard Copilot feature training, here are the features, here are some prompts, while others received partnership training developed by AI Mindset, focused entirely on behavioral change rather than technical know-how.

The results were starkly different.

Pairs assigned to the structured protocol produced significantly lower-scoring documents than those who worked without any enforced workflow, averaging 10.68 out of 22 compared to 15.63 for the control group. Treatment pairs were also over eight times more likely to fail to produce a document at all within the allotted time. The protocol created real coordination costs — synchronous meetings, sequential prompting steps, compliance overhead, that control pairs could direct toward content instead.

The individual partnership training told a different story. Participants who received it were more than twice as likely to produce a top-quality document compared to those who got standard feature training: 77 percent hit the maximum score versus 61.8 percent of controls. The continuous score difference was not statistically significant, which the researchers attribute to a ceiling effect — 68 percent of all documents scored the maximum 20 out of 20, leaving too little variance for a linear model to detect differences.

The tool is not the bottleneck

The AI Mindset training was developed by Conor Grennan, former Chief AI Architect at NYU Stern and CEO of AI Mindset, a company he describes as focused on transforming how professionals think about and use AI, not through feature knowledge, but through behavioral change.

The training did not include feature demos, prompt tips, or use cases. Instead it focused on three things: reframing AI as a collaborative partner rather than a search engine, replacing a single-query interaction habit with a multi-turn conversational one, and giving participants guided practice in iterative prompting. Its flagship course, Generative AI for Professionals, covers practical frameworks across 67 lessons and is designed for corporations looking to drive AI adoption and develop AI champions within their teams.

Grennan posted about the results on LinkedIn. He described the standard training approach most organizations use — "This is Copilot. Here are the best features. Here are some killer use cases. Here are some great prompts" — and contrasted it with AI Mindset's approach, which he said showed participants "how their brains were fundamentally wired wrong for AI" and demonstrated what it looks like to think with AI rather than query it. He wrote that the shift happened in just 30 minutes.

Crucially, he pointed out that the study was not his company's own research. Microsoft designed it, Microsoft ran it, and Microsoft analyzed the data and published the full paper. He wrote: "The company that makes Copilot proved that the tool isn't the bottleneck. How people think about it is."

He credited Alexia Cambon as "the star of the show," alongside researcher Alex Farach, and Microsoft team members Rebecca Janssen, Lev Tankelevitch, and Connie Hsueh. He also named Gap Inc. CTO Sven Gerjets as a key partner in supporting the research investment.

Participants who received partnership training also showed greater positive belief change over the course of the session, particularly on a measure of exploration and experimentation with AI. The researchers note, however, that because beliefs were first measured after the collaborative task, not before the study began, the observed shifts likely reflect recovery from the friction of the structured protocol rather than durable training-induced change.

What the study can and cannot tell us

The researchers are transparent about the study's limitations, and they matter for anyone drawing conclusions from the findings.

The most significant is a time-of-day confound: control participants completed their tasks in the morning, treatment participants in the afternoon. The researchers calibrated this against circadian performance data from cognitive psychology literature and found the treatment effect would need to be roughly three times larger than any fatigue effect documented in research to be fully explained by session timing, but they cannot rule it out entirely.

Differential attrition is a second concern. Treatment pairs were far less likely to complete and submit documents for both tasks, meaning quality comparisons are made on a selected sample rather than the full group. Statistical bounds testing confirms the collaborative task quality result holds under worst-case attrition scenarios. The individual task binary result does not pass the same robustness test.

Documents were graded using GPT-4o-mini rather than human raters, which showed strong rank-order agreement with human scores for the collaborative task but only moderate agreement for the individual task. The AI grader also showed a systematic positive bias of 4.9 points relative to human raters on average, though this was consistent across both conditions, meaning it does not distort the treatment effect estimates. The study was not pre-registered, and several analytical decisions, including the binary quality threshold and compliance classifications, were made after observing the data.

The study involved 388 employees across six functional areas at Gap Inc., with job levels ranging from individual contributors to directors. For EdTech providers and workplace learning platforms building AI skills programs, it offers a rare piece of peer-reviewed, real-world evidence that cognitive framing, not feature training, may be the variable that most shapes how effectively employees use AI tools.

Previous
Previous

Canada opens applications for $890 million sovereign AI supercomputer

Next
Next

Anthropic builds out energy team as AI compute demands intensify