Skip to content

Fix env image build failures and liveness probe timeouts in share-models notebook CI#4000

Draft
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-failing-github-actions-job-yet-again
Draft

Fix env image build failures and liveness probe timeouts in share-models notebook CI#4000
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-failing-github-actions-job-yet-again

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jun 5, 2026

CI was failing in two distinct ways: the SKLearnEnv Docker build was breaking due to EOL Python 3.8 and overly restrictive legacy package pins, and the managed online deployment was hitting liveness probe 502s before the MLflow serving container finished loading.

env_train/Dockerfile

  • python:3.8.13python:3.10 (3.8 EOL Oct 2024)
  • Dropped packages unused by the training script (matplotlib, psutil, tqdm, ipykernel)
  • Relaxed version pins to ranges that actually resolve today (pandas>=1.3,<3.0, scipy>=1.7,<2.0, numpy>=1.21,<2.0)

artifacts/model/conda.yaml

  • Removed pip<=22.0.4 (breaks modern dependency resolution) and deprecated azureml-ai-monitoring/azureml-contrib-services
  • Updated to python=3.10.*, mlflow>=2.0, relaxed remaining pins

Notebook deployment cell

  • Replaced the try/except HttpResponseError retry loop with ProbeSettings(initial_delay=300) — the correct fix for a slow-starting MLflow container rather than catching the failure after it's declared dead:
demo_deployment = ManagedOnlineDeployment(
    ...
    liveness_probe=ProbeSettings(failure_threshold=30, timeout=30, period=10, initial_delay=300),
    readiness_probe=ProbeSettings(failure_threshold=10, timeout=30, period=10, initial_delay=300),
)
ml_client_workspace.online_deployments.begin_create_or_update(demo_deployment).result()

Copilot AI changed the title [WIP] Fix failing GitHub Actions job build Harden share-models notebook against transient online deployment probe failures in CI Jun 5, 2026
Copilot AI requested a review from Chakradhar886 June 5, 2026 06:38
Copilot AI added 2 commits June 5, 2026 06:49
- Update env_train/Dockerfile: Python 3.8.13→3.10, remove EOL packages,
  relax version constraints that blocked builds
- Update artifacts/model/conda.yaml: drop old pip<=22.0.4 constraint,
  azureml-ai-monitoring/contrib packages, update to Python 3.10
- Notebook: swap HttpResponseError try/except retry for ProbeSettings
  (initial_delay=300s) which fixes the root cause of liveness probe
  timeouts instead of masking them
Copilot AI changed the title Harden share-models notebook against transient online deployment probe failures in CI Fix env image build failures and liveness probe timeouts in share-models notebook CI Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants