Skip to content

fix: reduce hosts plugin refresh interval#8620

Open
saewoni wants to merge 2 commits into
mainfrom
reduce-hosts-plugin-refresh-10s
Open

fix: reduce hosts plugin refresh interval#8620
saewoni wants to merge 2 commits into
mainfrom
reduce-hosts-plugin-refresh-10s

Conversation

@saewoni
Copy link
Copy Markdown
Contributor

@saewoni saewoni commented Jun 2, 2026

Summary

  • reduce the AKS LocalDNS hosts setup timer refresh interval from 15 minutes to 10 seconds
  • tighten timer accuracy from 1 minute to 1 second so the shorter cadence is honored
  • update the timer comments to describe the new refresh behavior

Validation

  • git diff --check
  • systemd-analyze verify parts/linux/cloud-init/artifacts/aks-localdns-hosts-setup.timer
  • docker run --rm -v "/home/sakwa/agentbaker:/src" shellspec-docker --shell bash --format d spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the systemd timer that periodically runs the AKS LocalDNS hosts setup job, reducing the refresh cadence so /etc/localdns/hosts is updated much more frequently.

Changes:

  • Reduce aks-localdns-hosts-setup.timer refresh interval from 15 minutes to 10 seconds.
  • Tighten timer scheduling accuracy from 1 minute to 1 second.
  • Update inline timer comments to match the new intended behavior.

Comment thread parts/linux/cloud-init/artifacts/aks-localdns-hosts-setup.timer Outdated
@saewoni saewoni changed the title Reduce hosts plugin refresh interval fix: reduce hosts plugin refresh interval Jun 2, 2026
@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented Jun 2, 2026

why 10s? we went from 15min to 10s — that's ~90x more frequent. some concerns before this lands:

  1. load math. on a 1000-node cluster × ~10s period × N critical FQDNs = ~100N dig qps against upstream. RandomizedDelaySec=5s only spreads across 5s of a 10s window, so peak smoothing is ~50%. what's the typical critical-FQDN count, and have we measured this against VNet DNS / 168.63.129.16 headroom? agentbaker fix: run aptmarkwalinuxagent hold operation on foreground #7797 is the precedent — provisioning-timing changes that affect upstream load need a production metric + canary before fleet rollout.

  2. OnUnitInactiveSec vs OnUnitActiveSec. you also changed the semantics. OnUnitInactiveSec=10s fires 10s after the script completes, not every 10s. If aks-localdns-hosts-setup.sh takes 3–5s (multiple dig calls with timeout 3, can retry across upstream servers), the effective period is ~13–15s. Either is fine, but the inline comment still says "10 seconds after each run completes" — OnUnitActiveSec would actually deliver the "every ~10s" reading more people expect. Pick one and make the comment match the semantics.

  3. AccuracySec=1s. default is 1min, which lets systemd coalesce wake-ups across timers. dropping to 1s defeats that batching — small per-node cost but real at fleet scale and not free on battery-constrained / low-end SKUs. is 1s accuracy actually load-bearing here when the wall-clock target is ~10s and the script itself takes longer than that?

  4. why 10s specifically? the prior comment said "15 minutes balances freshness against unnecessary DNS traffic" — that's gone now. what data drove 10s vs 30s vs 1min? if there was a specific stale-IP incident, link the ICM so the next person reading this .timer knows what 10s is calibrated against and when it can be relaxed again.

  5. e2e / perf signal. plan to canary on a small region and watch (a) upstream DNS qps, (b) node CPU/journald write rate from the per-run script logs, (c) any drift in aks-localdns-hosts-setup.service failures before rolling everywhere?

@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented Jun 2, 2026

why 10s? we went from 15min to 10s — that's ~90x more frequent. some concerns before this lands:

  1. load math. on a 1000-node cluster × ~10s period × N critical FQDNs = ~100N dig qps against upstream. RandomizedDelaySec=5s only spreads across 5s of a 10s window, so peak smoothing is ~50%. what's the typical critical-FQDN count, and have we measured this against VNet DNS / 168.63.129.16 headroom? agentbaker fix: run aptmarkwalinuxagent hold operation on foreground #7797 is the precedent — provisioning-timing changes that affect upstream load need a production metric + canary before fleet rollout.
  2. OnUnitInactiveSec vs OnUnitActiveSec. you also changed the semantics. OnUnitInactiveSec=10s fires 10s after the script completes, not every 10s. If aks-localdns-hosts-setup.sh takes 3–5s (multiple dig calls with timeout 3, can retry across upstream servers), the effective period is ~13–15s. Either is fine, but the inline comment still says "10 seconds after each run completes" — OnUnitActiveSec would actually deliver the "every ~10s" reading more people expect. Pick one and make the comment match the semantics.
  3. AccuracySec=1s. default is 1min, which lets systemd coalesce wake-ups across timers. dropping to 1s defeats that batching — small per-node cost but real at fleet scale and not free on battery-constrained / low-end SKUs. is 1s accuracy actually load-bearing here when the wall-clock target is ~10s and the script itself takes longer than that?
  4. why 10s specifically? the prior comment said "15 minutes balances freshness against unnecessary DNS traffic" — that's gone now. what data drove 10s vs 30s vs 1min? if there was a specific stale-IP incident, link the ICM so the next person reading this .timer knows what 10s is calibrated against and when it can be relaxed again.
  5. e2e / perf signal. plan to canary on a small region and watch (a) upstream DNS qps, (b) node CPU/journald write rate from the per-run script logs, (c) any drift in aks-localdns-hosts-setup.service failures before rolling everywhere?
  1. dns server load calculation is off. dns server throttling limit is against per vm, azure dns is 1000 qps per vm. so number of nodes does not matter in this calculation.
  2. do you dig in parallel? or in batches? OnUnitInactiveSec=10s is fine.
  3. 1s is fine, since you already randomized delay 5s.
  4. by design
  5. n/a

@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented Jun 2, 2026

stop re-review

Ubuntu and others added 2 commits June 3, 2026 20:55
Update the AKS LocalDNS hosts setup systemd timer to refresh every 10 seconds instead of every 15 minutes. Tighten timer accuracy to 1 second so the shorter cadence is honored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use OnUnitInactiveSec so the hosts setup service waits 10 seconds after each run completes before scheduling the next run. Add RandomizedDelaySec to de-synchronize nodes and reduce fleet-wide DNS bursts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 3, 2026 20:57
@saewoni saewoni force-pushed the reduce-hosts-plugin-refresh-10s branch from 30b6517 to b3d5d92 Compare June 3, 2026 20:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines 5 to +9
# Run immediately on boot
OnBootSec=0
# Refresh every 15 minutes. AKS critical FQDN IPs can change due to load balancer
# rotation, Traffic Manager failover, or regional DNS updates. 15 minutes balances
# freshness against unnecessary DNS traffic — stale IPs would cause the hosts plugin
# to serve unreachable addresses until the next refresh.
OnUnitActiveSec=15min
# Refresh 10 seconds after each run completes. AKS critical FQDN IPs can change
# due to load balancer rotation, Traffic Manager failover, or regional DNS updates.
# Frequent refreshes keep stale IPs from causing the hosts plugin to serve
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants