ci(e2e): gate staging e2e on critical staging-instance config drift#8757
ci(e2e): gate staging e2e on critical staging-instance config drift#8757jacekradko wants to merge 2 commits into
Conversation
🦋 Changeset detectedLatest commit: 0eb5396 The changes in this PR will be included in the next version bump. This PR includes changesets to release 0 packagesWhen changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Repository UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The staging e2e "generic" leg was red on ~100% of runs because a few independent failures sat behind an all-or-nothing gate: - whatsapp-phone-code: the WhatsApp channel is not enabled on the staging instance, so the button never renders and the suite times out every run. It also bypasses the isStagingReady graceful-skip, so skip it explicitly on staging until the channel is provisioned. - custom-pages "survives a parent rerender": validates an unreleased @clerk/react fix (#8604), but the staging leg installs published @latest, so it is deterministically red until release. Skip when E2E_SDK_SOURCE=latest; PR CI (ref builds) still covers it. - concurrency was keyed on ref (effectively always "main") with cancel-in-progress, so each new staging deploy cancelled the in-flight run and no commit could report a status. Key on the clerk_go commit instead. - raise the job timeout above the 25-minute test step so the job cap no longer kills runs mid-suite. - emit and upload a JSON Playwright report in CI so the report job can classify failures (flaky vs failed, infra vs regression) later.
fc18bdf to
7b59e11
Compare
validate-staging-instances.mjs already diffs prod vs staging /v1/environment but every exit path returned 0, so detected drift blocked nothing and the job was not a dependency of the test matrix. A drifted staging mirror (e.g. a missing phone_number WhatsApp channel) therefore surfaced only as opaque test timeouts 200 tests deep. Add a tight CRITICAL_PATHS allowlist (attribute enabled toggles, phone_number.channels, auth factors/strategies, social enable/disable, password settings) and an ACCEPTED_DRIFT escape hatch so known gaps don't block while new drift does. In strict mode the script exits non-zero on a blocking mismatch; fetch failures and cosmetic drift never fail the build. Wire integration-tests to need validate-instances, and drive strictness from the STAGING_VALIDATE_STRICT repo variable (default report-only). So this is a no-op until the team opts in: it logs blocking drift and the proposed gate without failing anything. Flip the variable to make it enforce.
07c335c to
0eb5396
Compare
Follow-up to #8756. The
validate-staging-instancesscript already compares prod vs staging/v1/environmentand prints a diff, but it always exited 0, so a drifted staging mirror (like the missing WhatsApp channel that makeswhatsapp-phone-codetime out) blocked nothing and stayed invisible until tests failed 200-deep.This gives the script teeth without flipping any behavior yet. It gains a tight
CRITICAL_PATHSallowlist (attributeenabledtoggles,phone_number.channels, auth factors, social enable/disable, password policy) plus anACCEPTED_DRIFTescape hatch, so a known and tracked gap doesn't block while new drift does. In strict mode it exits non-zero on a blocking mismatch; fetch failures and cosmetic drift never fail the build.Strictness is driven by the
STAGING_VALIDATE_STRICTrepo variable and defaults to report-only, andintegration-testsnow depends onvalidate-instances. So nothing changes until someone sets the variable: today it just logs the blocking drift and the gate it would apply. The piece worth a look is theCRITICAL_PATHSset, that is the policy of what is worth blocking a run over.Before enabling strict, run the validator against current staging to confirm the only blocking drift is expected, and add
ACCEPTED_DRIFTentries for anything intentionally tolerated. Stacked on #8756.