Why Human Evaluation is the Missing Piece in World Model Development

TL;DR

We're releasing OWL Eval, the first open-source evaluation platform built specifically for studying how humans perceive AI-generated videos. After running studies with hundreds of participants, we've learned that human evaluation reveals critical model failures that automated metrics completely miss. Our platform makes it dead simple to run these studies at scale.

OWL Eval is easy to get up and running, taking at most 30 minutes for a moderately experienced engineer. You can also try it at https://eval.wayfarerlabs.ai for solely the cost of the Prolific API credits. In the spirit of openness, we will not be monetizing user-guided usage of our evaluation harness.

The Evaluation Problem (That Everyone Ignores)

Here's the uncomfortable truth about world model development: often, individuals or even whole teams are flying blind when it comes to human perception of quality, enjoyment, controls or intended audiences. If you don’t know what people will enjoy or intuitively are willing to interact with, how can you create an enjoyable, immersive experience?

You can optimize for computational quantitative scores all day, but these metrics don't tell you what actually matters to consumers. Do the physical elements behave in a believable way? Would a person want to interact with this environment? When your model generates a "beach scene," does it actually *feel* like a beach to a human viewer? What is “plausibly deniable” in this sort of environment vs what is immersion or a context break?

At Wayfarer, we learned this the hard way. Our early models scored great on automated benchmarks but looked completely wrong to anyone who watched them. The temporal consistency was off. Object interactions felt fake. The overall "vibe" was just... not right.

That's when we realized: If you're building models for humans to use, you need humans to evaluate them.

What Humans See That Metrics Miss

After running evaluation studies on thousands of generated videos, we've identified the blind spots in automated evaluation.

One of the most striking discoveries is how instantly humans spot violations of basic physics intuition. A ball that bounces too high, water that flows upward, or shadows that don't match the lighting—these anomalies jump out to human viewers immediately, yet automated metrics often miss them entirely. This extends beyond simple physics to temporal coherence, where the question isn't just "does the object stay consistent frame-to-frame" but "does the motion feel natural over time?" Humans are incredibly sensitive to unnatural acceleration patterns and motion artifacts, maintaining these expectations even in the most fantastical scenarios where temporal coherence remains crucial for immersion.

Contextual appropriateness presents another dimension where human judgment excels. Humans intuitively know what a cozy living room feels like, and they each bring individual preferences about what "cozy" means to them. They can tell if the lighting, furniture arrangement, and atmosphere match the intended mood or align with their personal expectations. Metrics, by contrast, merely check if objects are present without understanding the gestalt that creates a believable environment.

Perhaps most impressively, humans possess an innate understanding of interaction plausibility—they judge whether character movements, animal behaviors, or environmental responses make sense within the context of the scene. This kind of holistic understanding proves nearly impossible to capture with traditional metrics. Even runtime measurements of effects can't provide the kind of valuable feedback you need before making an experience available to users.

The Five Dimensions That Actually Matter

Through extensive testing, we've found that human video evaluation consistently breaks down into four key dimensions:

Overall Quality: The holistic "does this look and feel real?" assessment. Is this the “vibe” I wanted to immerse myself in?
Controllability: How well does the output match what was requested? Did I get the experience I wanted?
Visual & Audio Quality: Frame-by-frame clarity, artifacts, and aesthetic appeal. Are the visuals and audio enjoyable to me?
Temporal Consistency: Motion smoothness and coherence over time. Is the world behaving how I expect it would?
Fun/Entertainment!: Is the experience actually fun for the user? Something we hope to evaluate going forward.

Last But Not Least: Fun/Entertainment

An important aspect to not be overlooked as world models evolve is the tension between fun and challenging enough to keep people engaged and entertained. One key aspect of this is users experiencing the unexpected. We've seen how when the environment in the view breaks and shifts to another level, it can be entertaining but not immersive. We hope to create experiences that aren't just fun for their flaws but fun because they hit that sweet spot in between expected and unexpected. This will be important to evaluate to keep developers honest about what they're creating and whether it's building that connection many games of nostalgic yore have for many of us, or whether the experiences end up another forgettable throwaway novelty? Who doesn't remember the startup sequence from their favorite consoles and game, for instance? The opening menu for Mario 64 pinching his cheeks and nose was more fun than 95% of interactive content online today.

Running Studies That Actually Work

The biggest barrier to human evaluation isn't finding participants—it's designing studies that produce reliable, actionable results.

Screening is Everything: We learned that 20-30% of participants will give random responses if you let them. Our screening process asks participants to compare a quality experience with an intentionally subpar incoherent experience. If a participant selects the subpar, degraded experience: they're automatically filtered out.
Side-by-Side Comparisons Work: Instead of asking "rate this video 1-10," we show two videos and ask "which is better?" Humans are much more reliable at relative judgments than quantitative assessment.
Context Matters: The same video will be rated differently if it's presented as "a beach scene" versus "a winter landscape." Make sure your evaluation matches your intended use case.
Speed vs Quality: Through empirical studies, we found the sweet spot is 1-1.5 minutes per video evaluation. Longer than that and attention drops. Shorter and people don't have time to notice important details.

What We've Learned About Our Own Models

Running these evaluations on our own models has been humbling and incredibly valuable. Our controllability scores revealed systematic issues that we never would have caught with automated metrics. When users asked for specific objects or actions, our models would often generate something similar but not exactly what was requested, making controllability a major focus for us going forward and we encourage everyone to learn from these results. We've also discovered that temporal consistency varies significantly by scene type—our models handle static scenes much better than dynamic ones. Beach waves and forest scenes work great, but anything with complex object interactions is something we hope to improve in our future work.

Perhaps most encouraging is that humans agree more than expected; with proper screening, we see strong inter-rater reliability. When someone says a video looks wrong, others usually agree, which gives us confidence that the feedback reflects real quality differences, not random noise.

The 2-Hour Study That Changed Everything

Our largest study to date evaluated 39 videos with 10 participants each through our preferred human data partner Prolific.

Total turnaround time: under 2 hours.

The speed at which we completed the evaluation and then implemented that into our development process continually improves our thinking when it comes to how we deploy models. Instead of waiting weeks for feedback, we can now test new model variants and get human evaluation results the same day. It's become part of our regular development cycle, not a special project we do occasionally.

What's Next

We're adding richer feedback mechanisms, including free-form comments so participants can explain why they prefer one video over another, with early tests showing people are surprisingly articulate about specific issues. Beyond the core four dimensions, we're developing domain-specific evaluation criteria tailored to different content types—game environments, product demos, and educational content. We're also exploring longitudinal studies to understand how human preferences change as people see more AI-generated content and whether standards shift over time. Cross-cultural evaluation is another frontier we're investigating, as evaluation results may differ across different cultures and demographics, which could be crucial for localization and global deployment.

We also want to hear from the community! What sorts of things could we also be evaluating to improve our development cycle? Comment or send us a message and we'd love to include it in our future evaluations.

Why This Matters Beyond World Models

Human evaluation isn't just useful for world models—it's essential for any AI system that humans will interact with. Whether you're building recommendation systems, content generators, or decision-support tools, understanding human perception should be part of your evaluation pipeline.

We're open-sourcing OWL Eval because we believe every team building AI for humans should have access to reliable human evaluation. The platform works with any video content, not just world models. If your content isn't specifically video media, talk to us! If you need assistance extending this work we’d be happy to collaborate.

If you're tired of optimizing metrics that don't correlate with what humans actually care about, give human evaluation a try! The insights will both shock and delight you.

We're actively looking for researchers and practitioners who want to push the boundaries of human-AI evaluation. Join our Discord to discuss methodologies, share results, or contribute to the platform.