Methodology -- How We Test and Benchmark Web Performance -- WebVitals.tools

Why methodology matters

Web performance data is notoriously easy to mislead with. A single Lighthouse run on a fast laptop with a warm cache can show an LCP of 0.9 s for a page that real users experience at 4.2 s on a mid-range Android device. Synthetic lab scores diverge from field data when throttling is absent, when the test machine is on a direct fibre connection, or when the test skips the cold-cache first-visit path that most first-time visitors experience. Without a consistent, published methodology, benchmarks cannot be compared across sites, reproduced by other teams, or trusted as the basis for engineering decisions.

Our goal is to serve developers who are making real trade-offs: choosing a framework, justifying a refactor, or deciding whether a third-party script is worth its performance cost. Those decisions deserve data that is as close to what real users experience as a controlled environment can produce. That means always combining lab tools with field data, always running multiple iterations, and always being explicit about what the numbers cannot tell you.

This methodology applies to everything published on WebVitals.tools: fix guides, framework comparisons, benchmark matrices, and blog posts. For details on who reviews content before it goes live, see our editorial standards.

Lab tools we use

Lab tools run in a controlled, reproducible environment. They give us consistent numbers across reruns and make it possible to isolate the effect of a single change. We use three primary lab tools, each chosen because it measures a different layer of the performance stack.

Lab tool

Lighthouse

Lighthouse is our default entry-level tool for every benchmark run. We run it in headless Chrome via the Node CLI rather than the DevTools panel to avoid user-extension interference and to enable scripted automation. Each Lighthouse run uses the simulated throttling preset that approximates a Moto G4 on a slow 4G connection (round-trip latency 150 ms, 1.6 Mbps download). We collect the six lab metrics -- FCP, LCP, TBT, CLS, Speed Index, and TTI -- along with all Opportunities and Diagnostics. Because Lighthouse's simulated throttling introduces variance, we discard the single-run score and instead take the median across five sequential runs with a cold cache cleared between each. For a step-by-step guide to interpreting Lighthouse output, see our Lighthouse audit walkthrough tutorial.

Lab tool

WebPageTest

WebPageTest provides a real browser on real hardware at a remote test agent, which eliminates the simulation gap that Lighthouse's CPU/network throttling can introduce. We use the public WebPageTest API with the Dulles, Virginia agent (EC2 instance) for North American baselines and a London agent for European comparisons. Tests are configured for Chrome on a Moto G4 emulation profile with 3G Slow throttling (1.6 Mbps down, 768 Kbps up, 300 ms RTT) to match the Chrome UX Report's slow-device approximation. We run five First View tests in sequence and record filmstrip, waterfall chart, and individual metric readings. WebPageTest's per-request waterfall is particularly useful when diagnosing render-blocking resources, third-party chain delays, and font loading patterns. Our full setup guide is in the WebPageTest tutorial.

Lab tool

Chrome DevTools Performance panel

Chrome DevTools Performance recordings serve as our deep-trace layer when we need to attribute a score to a specific cause: a long task on the main thread, a layout-shift event tied to a late-loading image, or an interaction delay caused by a hydration burst. We record the Performance panel with CPU 4x slowdown and the Slow 3G network preset enabled simultaneously. This combination approximates the conditions of a user on a low-end device with a congested mobile connection. Every trace we reference in fix guides and blog posts is captured this way, ensuring the flame chart and INP attribution data reflect realistic bottleneck sizes rather than the sub-millisecond timings visible on a high-end development machine. Traces are exported as JSON so the exact run can be re-imported by any contributor to verify findings.

Field tools we use

Field data captures how real users -- on their own devices, networks, and browser versions -- actually experience a page. Lab tools tell you what could happen under a fixed set of conditions; field tools tell you what is happening across the entire distribution of conditions your users encounter. We draw on three field sources:

CrUX dataset

The Chrome UX Report (CrUX) aggregates anonymized, opt-in Core Web Vitals measurements from real Chrome users across the web. Google publishes CrUX data monthly via BigQuery and as daily rolling 28-day aggregates via the CrUX API. We use both: the monthly BigQuery snapshots for long-term trend analysis and the daily API for near-real-time origin lookups in our CWV Checker tool. CrUX segments data by connection type (4G, 3G, offline), device type (desktop, phone, tablet), and effective connection type, which allows us to report p75 values for the phone segment separately -- the metric threshold that matters most for Google Search ranking signals.

web-vitals JavaScript library

For pages we control directly, we instrument collection with the official web-vitals JavaScript library. The library exposes callbacks for LCP, CLS, INP, FCP, and TTFB using the same PerformanceObserver APIs that Chrome uses internally to compute CrUX values, which means our library measurements and CrUX readings are directly comparable. Metrics are flushed with navigator.sendBeacon on page hide to capture the final LCP candidate and the final CLS score after all layout shifts have settled. The full instrumentation approach is documented in our RUM setup tutorial.

Custom RUM pipeline

Our custom RUM pipeline aggregates web-vitals library readings into a time-series store, segmented by URL path, device class, and connection type. We compute p75 values per segment rather than averages because the 75th percentile is the threshold Google uses in the CrUX ranking signal: a site passes the LCP threshold if its p75 is at or below 2.5 s. P75 is a more meaningful production target than the mean because the mean is sensitive to extreme outliers at both ends of the distribution, whereas the p75 captures the experience of the slower half of your above-median users -- exactly the population most likely to abandon the page.

Framework benchmark matrix

Our framework matrix measures the out-of-the-box Core Web Vitals of a canonical starter application for each framework under identical, controlled conditions. The goal is to show what a developer gets before any performance optimization, so that teams can understand the baseline they are starting from when they pick a framework for a new project.

Each starter app is scaffolded using the framework's official CLI or recommended template with no modifications beyond what is necessary to deploy to a static host. The app renders a representative page: a heading, three paragraphs of body text, one above-the-fold image served from the same origin, and a navigation bar. This template is deliberately minimal so that the measured scores reflect framework overhead -- hydration cost, runtime bundle size, CSS injection strategy -- rather than application-level choices.

Tests run against the deployed production build (not a dev server) using WebPageTest with the 3G Slow throttling profile (1.6 Mbps down, 768 Kbps up, 300 ms RTT) to match real-world slow-device conditions. We run 25 First View tests per framework per snapshot and report the p75 for each Core Web Vitals metric across those 25 runs. Using 25 runs rather than the more common 3-5 runs significantly reduces the effect of network jitter and agent-side variance on the reported p75. Frameworks included in the current matrix: Next.js, Remix, Astro, SvelteKit, Nuxt, Gatsby, Create React App, and Angular. For the specific p75 numbers and trend charts from the most recent benchmark run, see the performance tools and resources page.

Statistical handling

Raw performance data from any automated testing tool contains outliers caused by test-agent resource contention, temporary network congestion, and browser warm-up effects. Reporting a single run or a naive mean without outlier treatment will overstate or understate performance by a margin that matters when thresholds are tight. Our statistical pipeline follows these steps:

Collect N runs. For Lighthouse we use N=5; for WebPageTest framework benchmarks we use N=25. The larger sample for framework benchmarks is necessary because we are comparing systems that may differ by only tens of milliseconds at p75.
Discard outliers. We apply the interquartile range (IQR) method: any observation beyond 1.5 x IQR above Q3 or below Q1 is flagged as an outlier and excluded from the reported distribution. In practice, this removes 1-3 runs per 25 in most benchmark cycles, usually corresponding to runs where the test agent was under load from another process.
Report median and p75. We publish both. The median (p50) is the central tendency; the p75 is the performance threshold value. For CrUX-derived comparisons we report only p75 because the CrUX scoring threshold (good / needs improvement / poor) is defined at p75.
Sample size disclosure. Every benchmark table includes the number of retained runs after outlier removal and the date range of the test run. This allows readers to assess statistical confidence and to flag results that look anomalous.

For the framework matrix, we additionally report the interquartile range alongside the p75 to give a sense of spread. A framework with a p75 of 2.4 s and an IQR of 0.1 s is behaving very consistently; the same p75 with an IQR of 0.6 s suggests high variance that warrants investigation before drawing strong conclusions.

CrUX data sourcing

All population-level Web Vitals statistics published on this site -- passing-rate percentages, metric averages across frameworks, and year-over-year trend lines -- are derived from the official Chrome UX Report BigQuery dataset (chrome-ux-report.all.YYYYMM). Google releases a new monthly table on or around the 9th of the following month.

Our process is: (1) run the BigQuery SQL queries against the latest monthly snapshot at the time of publication; (2) export the aggregated results to a versioned JSON file checked into the site repository; (3) render the numbers into the page at build time so there is no client-side API call on page load. This approach means the numbers are stable and auditable -- you can inspect the raw JSON in the repository to verify any figure we publish.

Snapshot versions are referenced by month throughout the site. The most recent comprehensive data analysis is based on the April 2026 snapshot (released in early May 2026), which covers real-user CrUX data collected throughout April 2026. The full analysis of that snapshot, including passing rates by metric, device class, and industry vertical, is published in our April 2026 Core Web Vitals data blog post.

When a new monthly snapshot is released, we update the CrUX-derived numbers in the framework benchmarks, the fix guides, and the homepage summary cards within two business days. Each updated page receives a new dateModified value in its JSON-LD schema and a corresponding entry in the site changelog.

Editorial cadence

Keeping performance data current is as important as the initial accuracy. Lighthouse, Chrome DevTools, WebPageTest, and the web-vitals library all ship breaking changes at least annually, and CrUX thresholds have historically shifted as the browser population changes. Our editorial process has three layers:

Daily refresh

CrUX API data for the origins tracked in our CWV Checker tool is refreshed daily from Google's CrUX History API, which provides the rolling 28-day aggregates for each origin. The daily refresh runs at 10 AM EDT via a scheduled task and the results are committed to the data store without manual intervention. If the API returns incomplete or malformed data for an origin, the previous day's values are retained and a warning is surfaced in the tool UI.

Monthly benchmark reruns

The full framework benchmark matrix is re-run each month within one week of a new CrUX snapshot release. Reruns use the same N=25 WebPageTest methodology described above. If a framework ships a major version between snapshot cycles, we run an interim benchmark for that framework only and note the version change in the published table. Monthly reruns are the primary driver of content updates across the fix guides, since a framework update that changes the default output (e.g., a new image optimizer in Next.js, a new streaming strategy in Remix) will move the benchmark numbers and may require new recommendations.

Content review SLA

Any page that contains a metric threshold, a benchmark number, or a tool version reference is flagged for review within 90 days of its last dateModified stamp. The review checks whether the referenced tool has shipped a new major version, whether the cited thresholds are still current, and whether newer CrUX data changes any of the stated conclusions. Pages that pass review without changes still receive an updated dateModified to reflect that they were checked. Pages that require changes are updated and logged in the changelog.

Reproducibility checklist

Any developer should be able to reproduce our published numbers by following these steps. We publish enough configuration detail for every benchmark that this checklist is complete:

Use the same tool version listed in the benchmark table (Lighthouse, WebPageTest agent, Chrome version).
Deploy the starter app to the same deployment platform (Vercel edge network) and disable all cache warmup before running the first test.
Configure WebPageTest with the Dulles, Virginia agent, Moto G4 device emulation, and 3G Slow throttling (1.6 Mbps / 768 Kbps / 300 ms RTT).
Clear cookies and storage between each run (First View only; repeat view is not measured).
Run exactly 25 iterations and collect the raw result JSON from the WebPageTest API response.
Apply IQR-based outlier removal (1.5 x IQR fence) and compute p75 from the retained observations.
Compare against the CrUX snapshot month listed in the benchmark; do not mix monthly CrUX data with the current benchmark run if they are from different calendar months.
Record the Chrome version reported by the test agent, since Lighthouse scores can shift by 2-5 points across major Chrome releases due to internal scoring weight changes.

If you run this checklist and your results differ materially from ours, open an issue on our GitHub repository with the raw result JSON attached. We investigate every reproducibility report and either update the published data or document why the numbers legitimately differ (for example, if the framework shipped a patch between our run and yours).

Limitations

No methodology is without constraints. Being transparent about the limits of our data is part of our editorial standards. Readers should keep the following limitations in mind when applying our findings to their own projects:

Lab data does not equal field data

Synthetic benchmarks use a fixed network and device profile. Real users arrive with wildly varying hardware, network conditions, browser extensions, operating system schedulers, and battery states. A framework that scores well in a controlled Moto G4 simulation may perform differently for a user on a three-year-old budget phone in a weak LTE signal. We never draw conclusions about real-world user experience from synthetic data alone. Where CrUX field data is available, it always takes precedence over lab numbers when characterizing actual user impact.

Starter-app benchmarks are not production benchmarks

Our framework matrix measures minimal starter apps, not production applications. A Next.js app with server components, edge caching, and an optimized image pipeline will behave very differently from the default scaffold we benchmark. The matrix tells you about framework defaults and baseline overhead; it does not predict what a mature, well-optimized production site on that framework will achieve. Use the matrix as a starting point for framework selection, then measure your own production application with the same methodology before drawing architecture conclusions.

CrUX coverage is not universal

CrUX data is only available for origins that meet minimum traffic thresholds in Chrome. Low-traffic origins, intranet sites, and sites with predominantly non-Chrome traffic will not appear in CrUX. For those origins, lab tooling is the only available signal, and the limitations above apply with full force.

Metric definitions evolve

Google has changed Core Web Vitals metric definitions and thresholds in the past -- INP replaced FID in March 2024, and LCP element eligibility rules have been refined across multiple Chrome releases. Any historical comparison of CrUX data across years must account for these definition changes, which can cause apparent trend lines that are artefacts of measurement changes rather than real performance shifts. We annotate benchmark charts with the metric version in effect at the time of each snapshot.

Geographic and demographic coverage

Our WebPageTest agents are located in Dulles, Virginia and London, United Kingdom. Published lab benchmarks reflect North American and Western European network conditions. Users in Southeast Asia, Sub-Saharan Africa, and Latin America often experience significantly higher latencies and operate on lower-bandwidth connections, which would produce materially worse metric readings than our published benchmarks. For a comprehensive view of your own site's geographic distribution, combine CrUX BigQuery data with country-level segmentation, or set up additional WebPageTest agent locations. Our performance tools resource page lists agent location options and regional benchmarking strategies.

Questions about our methodology or disagreements with a specific data point? Open an issue on our GitHub repository or review our editorial standards for more detail on how we handle corrections. We review every methodology question and respond publicly so the discussion benefits other readers.