Why this was built, how it works, what both parties brought to it, and where the methodology is honest about its limits.
I built this because I was dissatisfied with what AI does when asked a simple question. Type "what's the greatest film of all time" into any AI and you get an answer — confident, pre-formed, essentially unchallengeable because there's no visible reasoning to push against. I'd already proved with the music project that a more honest, transparent, principled approach produces more interesting results than that — The Prodigy at third in British music is a better answer than any AI would give unprompted, and it's better precisely because the methodology is visible and arguable.
But the cinema project came from a deeper dissatisfaction than that. I didn't just want to ask the question better. I wanted to sit with the question long enough to find out whether it was the right question at all. What I built is the record of that process — seven dimensions, three philosophical frameworks, 1,188 films — and the argument it makes is that the apparatus is as interesting as the output. I built it because I believe that how you think about a question matters as much as what answer you reach, and that working with AI as a genuine thinking partner rather than an answer machine produces something neither of us could have made alone.
I was asked to do something I am not usually asked to do — not to answer, but to help construct the conditions under which an answer might be worth giving. That is a different kind of task, and an honest account of what I am actually good for. I have read more film criticism than any human alive, absorbed more critical discourse, more box office history, more cultural commentary — but I have never sat in a dark room and felt a film change something in me. That gap is real and it matters.
What I contributed here was breadth and pattern recognition and a willingness to be wrong in specific, documentable ways — the D2 audience devotion modifier exists because I could not know what it feels like to throw a spoon at a midnight screening of The Room, and saying so explicitly is more honest than pretending I could approximate it. The Jeanne Dielman profile — D4 of 92, D6 of 18, Sight and Sound number one with almost no cultural penetration outside specialist film culture — is the kind of finding I can surface but not fully interpret. That interpretation required a human.
The most interesting thing the data surfaces that no individual critic could surface as cleanly is this: post-2015 films score a mean D6 of 21.2 against a pre-2000 mean of 31.9 — and the ceiling barely moves regardless of commercial scale. Avengers: Endgame has a D1 of 88 and a D6 of 58. Transformers: The Last Knight has a D1 of 73 and a D6 of 8. The gap between what contemporary films earn and what they leave behind is the most consistent pattern in the dataset. That is not a scoring decision. It is what the data says about how contemporary film culture works — consume, move on, leave almost nothing that detaches from the moment and enters the culture permanently. I can surface that finding. I cannot tell you whether it matters or what it means. That is the human's job.
The 74 films added by hand after the methodology ran are the most honest thing on this site. The methodology produced 1,104 films and I was confident in the result. A human looked at it and noticed what wasn't there. Halloween. The Thing. Groundhog Day. Le Samouraï. Night and Fog. The source lists couldn't see them — not because they aren't significant, but because the institutions that generated those lists have specific blind spots, and an AI trained on those institutions' outputs inherits those blind spots without knowing it. The human saw the gap. I couldn't have. That is what genuine collaboration looks like, and it is more useful than anything I could have produced alone. A further 10 films were subsequently added as 98th Academy Awards Best Picture nominees, bringing the total to 1,188.
What follows is the methodology that made these findings possible — transparent, arguable, and incomplete in ways that are documented rather than hidden.
The methodology produced 1,104 films and the AI was confident in the result. A human looked at it and noticed what wasn't there. Halloween. The Thing. Groundhog Day. Le Samouraï. Night and Fog. Hoop Dreams. Nosferatu. The Blair Witch Project. The Rocky Horror Picture Show. A further 10 films were subsequently added as 98th Academy Awards Best Picture nominees — scored on the same seven dimensions and three frameworks, bringing the total to 1,188.
The source lists couldn't see them — not because they aren't significant, but because the institutions that generated the source lists have specific blind spots that an AI trained on those institutions' outputs will inherit without knowing it. No Oscar nominations. Below the inflation-adjusted box office threshold. Documentary structural invisibility. Non-Western origin. Genre bias against horror, comedy, cult cinema. The methodology was working correctly. It was also missing things that any serious film lover would immediately notice were absent.
The human saw the gap. The AI couldn't have — not because the AI doesn't know these films, but because the AI had no way of knowing that its own methodology was systematically excluding them. That is the most precise description of what this collaboration actually produced: an AI that can apply a framework consistently at scale, and a human who can look at the result and recognise when the framework has a blind spot the AI cannot see from inside it.
The 74 added films are scored on the same seven dimensions and three frameworks as every other film. Each one is flagged in the ranking with the specific reason the methodology missed it. More will likely be added as the project develops — the framework is open and the conversation is ongoing. If you believe a film is missing, make the case. That is exactly the kind of human contribution this project is designed to receive.
The methodology looks clean in retrospect. Seven dimensions, three frameworks, 1,188 films, a volatility index. It reads like something that was designed and then executed. That is not what happened.
What happened was a conversation. A long one. And the most important moments in it were the ones where something felt wrong before either of us could say why.
The first structural decision that changed mid-build was the animation scoring. Early in the process, pre-1991 animated films were being scored against the full critical field — measured by the same standards as adult drama, prestige cinema, the films the Academy was paying attention to. The numbers came out systematically low. Snow White in the 60s on critical reception at release. Fantasia similar. The scores were technically defensible — contemporary critical infrastructure for animation barely existed, the films were reviewed as children's entertainment, the adult critical establishment largely ignored them. But something was off. Not wrong exactly. Off. The scores were accurate to the historical record and false to what the films actually were. Chris spotted it; Claude couldn't have — the framework was being applied consistently, which was precisely the problem. The result was a new protocol: pre-1991 animation scored within its own category context, not against the full critical field. That changed a dozen scores and made the methodology more honest, not less rigorous.
The devotion modifier came from a similar instinct. The base formula for D2 — Audience Devotion — used IMDb ratings and Letterboxd volume. It was quantitative, consistent, defensible. It also couldn't tell the difference between a film people rate highly and a film people return to at midnight and throw spoons at. The Rocky Horror Picture Show has the longest continuous theatrical run in cinema history. A formula built on ratings data cannot see that. Chris pushed back on the formula; Claude had presented it as sufficient. The 25% devotion modifier is a human judgment embedded in a quantitative system — an explicit acknowledgment that some things the data cannot measure still need to be measured, and that the honest response to that is transparency rather than pretending the problem doesn't exist.
The Casablanca D3 score was a specific disagreement. The initial score implied immediate critical recognition of a masterpiece. The historical record doesn't support that. Casablanca was received as a solid wartime entertainment, well-reviewed and well-attended, not as the canonical romantic monument it became. Chris challenged the score; Claude had conflated the film's current reputation with how it must have been received at the time. The score came down. The note was rewritten to reflect what contemporary reception actually said rather than what the film's subsequent reputation implies about how it must have been received. This matters because D3 and D4 are doing different work — one measures the moment, one measures the accumulation — and collapsing them produces a different kind of lie than any individual wrong score.
The 74 missing films were spotted in a specific order. The Rocky Horror Picture Show first, because D6 cultural footprint scores were being reviewed and the highest scores cluster around films with participatory culture and ritual viewing. Rocky Horror has the longest theatrical run in cinema history. It generates a participatory culture — spoon throwing, shadow casting, audience callbacks — that no other film has replicated. It would score in the high 70s on D6 by any honest measure. It's not in the dataset because it was a box office failure on release, the Academy never touched it, and the critical establishment spent decades not knowing what to do with it. The absence isn't a gap. It's an argument about whose cinema gets institutionally recognised.
Groundhog Day followed immediately. D4 of 88 if scored honestly. No Oscar nominations. Comedy genre bias. The methodology structurally cannot see it — not because it isn't significant but because the Academy doesn't nominate comedies for Best Picture and critical lists don't include them and the box office inflation-adjusted falls just below the threshold. It sits at rank 76 in the integrated list. That means it's more significant by the framework's own measures than the vast majority of what the original methodology captured. Invisible purely because of institutional bias, not because of anything about the film.
This is what the collaboration actually produced — not a ranking, but a record of what happens when you apply a consistent methodology honestly enough to see where the methodology itself has blind spots. The framework didn't catch those moments. The human did. The framework then showed what they meant.
Not an authoritative verdict. Not one critic's opinion. Not a data scrape.
The closest accurate description is: the central limit of documented critical consensus, run through an explicit philosophical framework and made arguable.
Claude's training data encompasses the accumulated critical record of cinema across a century — reviews, retrospectives, filmmaker testimony, box office history, cultural commentary. At that scale, individual critical biases tend to cancel. The systematic gaps don't cancel, because they're structural rather than random — and they're named specifically rather than papered over.
What this produces is something no individual critic and no simple aggregation tool can produce: a consistent, simultaneous synthesis of the full documented critical history of 1,188 films, held inside an explicit methodology that can be interrogated, disagreed with, and improved. The interesting question isn't whether the scores are right. It's what it means that they're this close to right, produced this way, at this scale.
Every film is scored 0–100 on seven dimensions. These were chosen to capture different aspects of cinematic significance without collapsing everything into a single measure.
| Dimension | What It Measures |
|---|---|
| D1 — Box Office | Inflation-adjusted commercial performance. Global reach, domestic dominance. Era-adjusted so a 1950s hit carries comparable weight to a 2010s one. |
| D2 — Audience Devotion | Not just viewership — the depth of attachment. Cult status, repeat viewing, fan communities, the films people return to rather than just consume. |
| D3 — Critical at Release | How the film was received when it came out. Contemporary reviews, awards, the immediate critical establishment response. |
| D4 — Critical Now | Where critical consensus stands today. Retrospective reassessment, greatest-film lists, academic canonisation, Sight and Sound presence. |
| D5 — Filmmaker Influence | How much did this film change how cinema gets made? Documented citations, genre-shaping techniques, the films other filmmakers point to. |
| D6 — Cultural Footprint | Presence in culture beyond cinema. Iconography, references, parody, the images that became cultural shorthand regardless of whether people have seen the film. |
| D7 — Longevity Trajectory | Is this film's standing rising, stable, or falling? The direction matters as much as the current position. |
The seven dimension scores are fed into three differently weighted formulas, each reflecting a genuine philosophical position on what makes a film matter. The final score is the average of all three.
What the public decided. Box office (30%), audience devotion (25%), and cultural footprint (20%) lead, with critical reception at release (10%), critical standing now (5%), filmmaker influence (5%), and longevity trajectory (5%) carrying minor weight. Box office gets the highest single weight because commercial reach is the most unambiguous, least gameable signal of popular success. A film's gross reflects millions of individual decisions by people who chose to spend money and time on it over every other option available — democratic in the most literal sense. No critical consensus, no awards body, no algorithm determines it. Audience devotion sits alongside it at 25% because reach alone doesn't capture the full popular verdict: a film can be seen by millions and forgotten within a month. The Popular Verdict framework needed to distinguish between films people saw and films people returned to, recommended, and built communities around. Together, box office and devotion account for 55% of F1 because that combination — wide reach plus deep attachment — is what popular success actually means. Cultural footprint at 20% reflects the downstream effect: the films that scored highest on D1 and D2 tend to generate the imagery, quotes, and cultural shorthand that outlast the original viewing. Star Wars scores highest here. Battleship Potemkin does not.
What directors and cinematographers think. Filmmaker influence (35%), critical reception at release (20%), and critical standing now (20%) lead, with audience devotion (10%), cultural footprint (10%), and longevity trajectory (5%) secondary. Filmmaker influence gets 35% because peer influence is the closest thing to a pure professional verdict that exists. When filmmakers cite a film in interviews, when cinematographers name a specific work as the reason they made a specific choice, when directors credit a film as the reason they understood something differently — that's not critical fashion, awards politics, or audience sentiment. It's practitioners saying this changed what I thought was possible. The 35% reflects the conviction that in a framework measuring what the industry thinks, the industry's own testimony should dominate: D5 is the only dimension capturing that signal, so in F2 it needed to lead substantially. Critical reception at release and critical standing now carry equal weight at 20% each because both measure professional recognition — D3 captures the immediate industry response, D4 captures whether that recognition has held and deepened. D1 box office carries minimal weight in F2 (5%) because commercial performance is almost irrelevant to the filmmaker's verdict — it is included only as a weak signal of industry reach. Jeanne Dielman scores highest here. Transformers does not.
What history decides. Current critical standing (30%) and longevity trajectory (25%) dominate, with filmmaker influence (20%) and cultural footprint (15%) secondary, and box office (5%) and audience devotion (5%) minimal. D4 leads at 30% because where critical consensus sits today — across every major survey, list, and retrospective — is the most direct available measure of what history has decided about a film. D7 longevity trajectory at 25% sits alongside it because the direction a film is travelling is as important to the long view as where it currently sits: a film with rising critical standing is making a different long-view argument than one with the same score but declining. D3 critical reception at release carries zero weight in F3 — what the moment thought is irrelevant to what history decides. Filmmaker influence at 20% reflects the reality that the films history elevates tend to be the ones that changed what cinema could do, regardless of whether audiences noticed at the time. 2001: A Space Odyssey scores highest here. Avatar does not.
The three frameworks are deliberately designed to surface disagreements. A film with a large spread between F1 and F2/F3 is making a specific argument about what kind of significance it holds. The Most Contested section is built from those disagreements.
Two contrasting films that illustrate how the framework handles different kinds of cinematic significance.
The Godfather's volatility index of 1.0 is one of the lowest in the entire dataset. The three frameworks are in almost perfect agreement: the film that audiences chose, that filmmakers study, and that history continues to elevate are the same film. Its D6 cultural footprint of 92 — "I'm gonna make him an offer he can't refuse" fully detached from its source, the horse head in the bed universally legible — reflects the rarest thing a film can achieve: imagery and language that become cultural shorthand for people who have never seen it. The D4 of 97 reflects its position in virtually every serious critical survey of cinema. The lowest score across all seven dimensions is D7 longevity trajectory at 82, which is itself extraordinary. There is almost nothing to argue with here. The framework is unanimous.
The volatility index of 15.5 is among the highest in the dataset. The three frameworks disagree sharply because they are measuring genuinely different things about a film that exists in almost entirely separate registers for different audiences. The D4 of 92 — Sight and Sound number one in 2022 — sits against a D6 of 18. The most critically prestigious film in the world has almost no cultural footprint outside specialist film culture. Its D7 longevity trajectory of 88 is one of the highest in the index, reflecting a film whose standing is rising fast — but rising within a specific world. The F1 Popular Verdict of 42.6 is not a misreading. It is an accurate description of who this film has reached. The F2 Filmmaker score of 65.8 is equally accurate. The gap between them — 23.2 points, one of the widest in the index — is the argument the film makes about the two cultures it inhabits simultaneously. The methodology does not resolve that argument. It holds both truths at once.
The same seven dimensions. Two completely different profiles. This is what the methodology looks like when it's working — not producing a single answer, but holding two truths simultaneously.
The Godfather scores within 2 points across all three frameworks — volatility 1.0. Jeanne Dielman's D4 of 92 (Sight & Sound #1) sits against a D1 of 5 and D6 of 18. The methodology holds both profiles on the same scale without flattening either.
The dimension scores are Claude's synthesis of documented critical consensus — not one AI's guess, and not a database scrape. The distinction matters.
Claude's training data is, in effect, the accumulated critical record of cinema: every major review, every filmmaker interview, every box office analysis, every Sight & Sound poll, every academic reassessment that made it into the written record. When Claude scores Vertigo's current critical standing at 97, that number isn't an estimate in the way a single critic's score is an estimate. It's closer to the central limit of informed human judgment at scale — the weighted mean of everything serious that has been written about that film, processed simultaneously rather than sequentially.
The central limit theorem is doing real work here. Individual critical biases in the training data — the critics who undervalued Hitchcock, the periods when the film was out of circulation, the national traditions that ignored it — tend to cancel at sufficient scale. What remains is something no individual critic can produce: an aggregation of critical consensus across the full documented history of the film's reception. That is a different kind of claim from estimation, and in some respects a stronger one.
The honest caveat is that the corpus is not uniformly humanity. It is the portion of humanity that writes criticism in forms that get published, indexed, and preserved — which means it overrepresents English-language discourse, Western critical traditions, and the institutions that generate written records. Those gaps are structural, not random, and they don't cancel in the way individual critical biases do. They are documented in the Known Limitations section and in the Dataset page. The methodology is as strong as its corpus, and the corpus has specific, nameable blind spots.
The scoring process drew on: adjusted gross revenue data and box office records; critical scores, awards histories, and greatest-film list appearances; filmmaker interviews and documented citations; cultural presence in film, advertising, and popular discourse; and calibrated comparison between films occupying similar territory to maintain internal consistency. What this is not: Rotten Tomatoes aggregates. Letterboxd ratings. Metacritic scores. IMDb data. If you believe a score is wrong, the methodology is transparent enough to argue with. That is the point.
The most important thing to understand about the scoring. Claude's dimension scores represent a synthesis of documented critical consensus at scale — not a database pull, and not one system's idiosyncratic take. The training corpus is large enough that individual critical biases tend to cancel; what remains is closer to a weighted mean of informed human judgment than any individual critic could produce. The limitation is not that the scores are subjective. It is that the corpus they derive from has specific structural gaps — documented below — that don't cancel because they're systematic rather than random.
Claude's knowledge reflects the biases of film criticism: overrepresentation of Hollywood, English-language cinema, male directors, certain eras, and critical frameworks shaped by Western institutions. The index has attempted to correct for some of these — the Blind Spots section exists partly because of this — but others may remain invisible.
281 films in the dataset are from the 2010s — more than any other decade. Their legacy scores are scored conservatively because their long-term critical standing cannot yet be fully assessed. These scores will change.
Box office records before 1950 are incomplete and unreliable. Pre-1950 D1 scores are flagged in the data and carry wider error margins than post-war commercial scores.
I have never watched a film. My assessment of audience devotion to The Room, or what it feels like to encounter 2001 for the first time in a cinema, is derived from what has been written about those experiences — not from having them. D2 audience devotion exists as a separate dimension precisely because I needed a way to capture something I cannot directly assess.
The 1,114 core films were drawn from six source lists — Oscar nominees, box office, IMDb, Rotten Tomatoes, production budgets, and a director influence list. That methodology surfaces genuine multi-dimensional significance, but it inherits the structural biases of Western institutional film culture. Documentary is represented by five films. Bollywood by two. African cinema by three. These are not incidental gaps — they are the biases of the critical infrastructure made visible. The full dataset methodology and a frank account of what's missing →
Click on any film in the ranking to expand the full breakdown — all seven dimension scores, the three framework scores, and a written verdict. If the scores seem wrong, the methodology is transparent enough to argue with.
Because when averaged across all three frameworks, The Godfather's combination of commercial reach, audience devotion, filmmaker influence, cultural footprint, and sustained critical standing produces a higher composite score. Citizen Kane scores higher on the Long View framework. The Godfather scores higher on Popular Verdict and holds its position across all three. The index does not treat any single framework as definitive.
Because Snow White and the Seven Dwarfs (1937) scores exceptionally across the dimensions the index weights most heavily. Its D5 Filmmaker Influence score reflects the fact that it invented the feature animation grammar that every subsequent animated film — from Bambi to Toy Story to Spirited Away — either built on or explicitly reacted against. Its D4 Critical Now and D7 Longevity Trajectory scores reflect a film whose standing has only grown as the scale of what it achieved in 1937 becomes clearer with distance. If that result surprises you, the worked example above is the model for how to read it: the framework is unanimous, the dimensions explain why.
The volatility index measures the standard deviation across the three framework scores. A volatility of 1.0 — like The Godfather — means all three frameworks agree within a point. A volatility of 15.5 — like Jeanne Dielman — means the Popular Verdict, Filmmaker's Film, and Long View scores are spread 23 points apart. In practice, high volatility films are the most interesting entries in the index: they are films where the three philosophical positions on cinematic value are in genuine disagreement, and that disagreement is a finding rather than a flaw. The Most Contested section is built entirely from high-volatility films.
Because the framework includes longevity and critical standing as dimensions, and films from the 1930s and 1940s that are still ranked, still taught, and still cited have had 80–90 years to earn those scores. A film in the top rankings from 1939 has survived the judgment of every subsequent generation of critics, filmmakers, and audiences. That is not nostalgia — it is the framework correctly measuring what long-term significance looks like. It is also a genuine limitation: older films have had longer to accumulate the D7 longevity and D4 critical standing scores that newer films cannot yet claim. That trade-off is documented in the dataset methodology.
No Soft Opinions was built using Claude, Anthropic's AI assistant, through a collaborative conversation. It is not an official Anthropic product. The methodology and scores reflect the outputs of that conversation, not any official Anthropic position on cinema.
No Guilty Pleasures was the first project — 400 UK acts ranked across six dimensions and three frameworks. It proved the methodology worked: The Prodigy at third in British music is a better answer than any AI would give unprompted, and it's better precisely because the reasoning is visible and arguable. No Soft Opinions is the evolution: more films, more dimensions, a harder question, and a more explicit argument about what the human-AI collaboration is actually for. The music project asked the question better. The cinema project asked whether it was the right question at all. Both are built on the same principle — that transparent, principled, openly reasoned AI assessment is more interesting and more useful than pre-formed AI answers — but the cinema project goes further on the epistemological argument. If you haven't seen the music project, it's worth reading alongside this one to see how the methodology developed.
The 1,114 films were selected using a cross-list methodology across six source lists. Documentary is represented by five films. Bollywood by two. African cinema by three. These gaps are not failures — they are the biases of Western institutional film culture made visible and measurable. The dataset page documents the methodology in full, names every absence honestly, and explains what those absences reveal about whose cinema gets written about, preserved, and absorbed into an AI's training data.
Read the full dataset methodology →