Digital twins, synthetic data, and how to not fool yourself

Innovation

Synthetic respondents are rapidly gaining traction and their appeal is undeniable. They promise speed, cost-effectiveness, and massive scalability, freeing researchers from the challenges of uncooperative participants, sensitive questions, and persistent privacy concerns. For organizations under pressure to deliver more, faster, and integrate AI, this buzz is impossible to ignore.

There is a big “but” though. Much of the information surrounding synthetic data comes from sources with a vested interest–companies selling synthetic audiences or those tied to standard data generation. While they offer opinions, and sometimes data, a bias the size of Texas is often present and simply cannot be ignored.

This is precisely why a rigorous new study from the Columbia Digital Twin program is so critical. A team of researchers at Columbia University, extending across social sciences and led by those in the Business School, recognized the urgent need for substantive, unbiased research. They set out to evaluate the new practice of using LLMs to create stand-in audiences, employing rigorous methods that meet the exacting standards of elite academic institutions. Having met some of the researchers involved, I am particularly impressed with their credentials, reputation, and the impressive scope of this undertaking.

The results of this research cannot be easily dismissed. They reinforce where synthetic audiences can be truly valuable, but also, crucially, where they reach their limits. To help practitioners navigate this evolving landscape, this article will:

Briefly summarize the key findings of the Columbia Digital Twin studies.
Identify where digital twins appear to work well and where they struggle.
Offer guidance on how synthetic data can be used responsibly and effectively in modern market research.

Why this Study Stands Out

The Columbia research program stands out because it evaluates digital twins at the level where many real-world applications skip: the individual. These digital twins weren’t trained to represent aggregated similarities, but actual people, using someone’s real responses to hundreds of questions spanning demographics, sociological information, personality traits, attitudes, and behavioral tendencies. Instead of relying on stereotypes, they employed an enormous scope of rich personal information.

Then, they posed brand new questions to both the real individual and their digital twin to compare the responses. Importantly, the studies were “preregistered” with evaluation metrics specified in advance, limiting the risk of cherry-picked success stories.

Much of the enthusiasm around synthetic respondents assumes that sufficiently detailed inputs set the stage for models to reason like humans. But did the research bear this out?

What Digital Twins Do Well

The Columbia University researchers found that digital twins excel when it comes to moving quickly. They’re well-suited for rapidly exploring large design spaces, testing alternative framings, and running early “what if?” analyses before investing in fieldwork. When teams are still figuring out what matters, speedy iteration—even with data that’s less than ideal—can quickly narrow a wide-open aperture and provide real value. In other words, sometimes an imperfect filter is better than no filter at all.

Synthetic data based on rich information like digital twins capture some broad regularities in human data. Across many tasks, the models preserve general population patterns and rough rank-order relationships. The average correlation between twin responses and human responses at the individual level is low (less than 0.20), but it is not zero, which suggests that the models are not responding entirely arbitrarily. They are reflecting shared structure in the data, even if they fail to reproduce specific judgments reliably.

Digital twins also do better when questions stay close to well-represented knowledge. Familiar constructs, standard framings, and topics that appear frequently in training data yield more stable and coherent outputs. This can be useful for surfacing plausible reactions, identifying obvious issues, or clarifying assumptions that might otherwise go unexamined. Think of this like a regression equation. If you are only extrapolating a little you may be fine, but go to far out and you are largely just guessing.

Seen this way, digital twins function less like synthetic respondents and more like structured hypothesis generators. They help researchers think faster, not necessarily think better. They also serve as a way to summarize or test knowledge close to the actual data underlying the synthetic audience. The closer in, the better.

Where the Limitations Start to Matter

At the same time, the studies highlight important, systemic limitations. The most consequential: digital twins do not reliably reproduce how individuals respond to specific questions. In many cases, they perform no better than much simpler demographic personas, calling into question the assumption that “more data about a person” naturally leads to accurate individual simulation. People are not easily reduced to a set of even complex dimensions.

The digital twins also improperly smooth out inconsistency, contradiction, and noise. This produces data that looks cleaner but it removes real, human inconsistency, sensitivity to context, and contradictions. Synthetic respondents are more coherent and more rational than the humans they are meant to represent.

The distortions are not random. In a companion paper, the researchers describe digital twins as “funhouse mirrors” that systematically reshape human behavior. They document recurring patterns such as stereotyping over individuation, social desirability bias, ideological smoothing, and hyper-rational responses. These biases point in specific directions, which means they can mislead rather than merely add noise. For example, if someone is classified as high-income, urban, college-educated, the twin leans heavily on what “people like that” say. The unique combination of attributes are flattened as the model errs toward stereotype.

Novelty Is the Hard Case

Another subtler yet important implication of the research: these models aren’t actually discovering anything new. They excel at preserving rank order, but that’s now what you want when it comes to surfacing new ideas. If everything is coming out of the models in the same order it went in, they’re not being creative. They’re just restating existing knowledge.

New ideas, by definition, lack strong anchors in existing data. When digital twins are asked to evaluate unfamiliar concepts, they tend to regress toward familiar patterns, producing conservative or muted responses. As a result, the ideas most in need of exploration are often the ones least well served by synthetic respondents. Now we’re looking not just at issues with accuracy, but the de-prioritization of novelty.

Because of this, synthetic data is better for interpolating in a known space rather than extrapolating very far into an unknown space. They are better at exploring the known efficiently, and poorly suited to identifying what might break from it.

But here’s the deeper tension. Research is primarily about discovery, not confirmation. We seek unexpected insights and counterintuitive results, the things that make us pause and rethink our assumptions. Surprise is the very signal that we are learning something real.

And yet, if a synthetic audience produces a result that is dramatically different from our expectation, we would rightly question it. We would assume the model was hallucinating, or mis-calibrated, or extrapolating a bit too far. That instinct is reasonable because the models are designed to give us what is expected.

That’s the structural problem that synthetic proponents have to deal with. Synthetic audiences are most trusted when they stay close to what is plausible. They are least trusted when they deviate meaningfully. Since architecture of large language models encourages reverting to known patterns, the system itself resists surprise.

So How Should Digital Twins Be Used?

Synthetic respondents are strongest where confirmation is sufficient and weakest where discovery matters most. That does not make them useless–it simply clarifies their role. Not as a substitute for authentic audiences, but a precursor to them.

The best role for synthetic data is as a pretest and comparison layer within the research process. Used thoughtfully, this helps us stress-test survey designs, identify problematic questions, and surface unspoken assumptions before going to field. We can directly compare synthetic and real respondent data, allowing us to identify similarities and differences without sacrificing discovery.

The key is understanding the preliminary role of the synthetic audience. Synthetic data is most valuable when it generates comparisons or hypotheses and uncovers assumptions rather than drawing conclusions. It tells us what we might see, not what people will actually choose or how strongly they feel.

In this limited role, synthetic respondents can improve efficiency without undermining research validity. This role helps researchers better understand the authentically human data better. It is not a replacement; it is a tool to better understand. Used properly, it extends our understanding rather than limiting it.

Practical Guardrails for Synthetic Data Usage

A few simple principles go a long way:

Synthetic data outputs should never be treated as truth.
Synthetic data should be considered a part of desk research.
When dealing with massive problems, synthetic audiences might help in narrowing the field.
When looking at the new or novel, rely on genuine humans.
With decisions of notable consequences, authentic human data remains the standard.

None of this requires fear or resistance to new tools. It reflects good research hygiene. Used properly, a baseline of experience for the frontline researcher will lead to better long-term use.

A Balanced Bottom Line

Synthetic respondents like digital twins are useful and limited tools. They can accelerate thinking, improve early-stage design, and help teams work more efficiently. But they are not people, and they do not reason the way people do.

The results of the Columbia University work do not argue against using digital twins. They argue against pretending they solve problems they do not. The right question is not whether digital twins belong in market research, but what is their proper and currently limited role in market research.

Want to join the conversation? Get in touch with me—robkphd@psbinsights.com

Digital twins, synthetic data, and how to not fool yourself

Why this Study Stands Out

What Digital Twins Do Well

Where the Limitations Start to Matter

Novelty Is the Hard Case

So How Should Digital Twins Be Used?

Practical Guardrails for Synthetic Data Usage

A Balanced Bottom Line

Knowledge that changes the game

Synthetic audiences and the pursuit of authentic intelligence

Duo MaxDiff choice modeling: better results by breaking the rules

Time to tackle that thorny problem