fix: reduce hosts plugin refresh interval by saewoni · Pull Request #8620 · Azure/AgentBaker

saewoni · 2026-06-02T00:27:05Z

Summary

reduce the AKS LocalDNS hosts setup timer refresh interval from 15 minutes to 10 seconds
tighten timer accuracy from 1 minute to 1 second so the shorter cadence is honored
update the timer comments to describe the new refresh behavior

Validation

git diff --check
systemd-analyze verify parts/linux/cloud-init/artifacts/aks-localdns-hosts-setup.timer
docker run --rm -v "/home/sakwa/agentbaker:/src" shellspec-docker --shell bash --format d spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh

Copilot

Pull request overview

This PR changes the systemd timer that periodically runs the AKS LocalDNS hosts setup job, reducing the refresh cadence so /etc/localdns/hosts is updated much more frequently.

Changes:

Reduce aks-localdns-hosts-setup.timer refresh interval from 15 minutes to 10 seconds.
Tighten timer scheduling accuracy from 1 minute to 1 second.
Update inline timer comments to match the new intended behavior.

yewmsft · 2026-06-02T04:03:50Z

why 10s? we went from 15min to 10s — that's ~90x more frequent. some concerns before this lands:

load math. on a 1000-node cluster × ~10s period × N critical FQDNs = ~100N dig qps against upstream. RandomizedDelaySec=5s only spreads across 5s of a 10s window, so peak smoothing is ~50%. what's the typical critical-FQDN count, and have we measured this against VNet DNS / 168.63.129.16 headroom? agentbaker fix: run aptmarkwalinuxagent hold operation on foreground #7797 is the precedent — provisioning-timing changes that affect upstream load need a production metric + canary before fleet rollout.
OnUnitInactiveSec vs OnUnitActiveSec. you also changed the semantics. OnUnitInactiveSec=10s fires 10s after the script completes, not every 10s. If aks-localdns-hosts-setup.sh takes 3–5s (multiple dig calls with timeout 3, can retry across upstream servers), the effective period is ~13–15s. Either is fine, but the inline comment still says "10 seconds after each run completes" — OnUnitActiveSec would actually deliver the "every ~10s" reading more people expect. Pick one and make the comment match the semantics.
AccuracySec=1s. default is 1min, which lets systemd coalesce wake-ups across timers. dropping to 1s defeats that batching — small per-node cost but real at fleet scale and not free on battery-constrained / low-end SKUs. is 1s accuracy actually load-bearing here when the wall-clock target is ~10s and the script itself takes longer than that?
why 10s specifically? the prior comment said "15 minutes balances freshness against unnecessary DNS traffic" — that's gone now. what data drove 10s vs 30s vs 1min? if there was a specific stale-IP incident, link the ICM so the next person reading this .timer knows what 10s is calibrated against and when it can be relaxed again.
e2e / perf signal. plan to canary on a small region and watch (a) upstream DNS qps, (b) node CPU/journald write rate from the per-run script logs, (c) any drift in aks-localdns-hosts-setup.service failures before rolling everywhere?

yewmsft · 2026-06-02T16:02:07Z

why 10s? we went from 15min to 10s — that's ~90x more frequent. some concerns before this lands:

load math. on a 1000-node cluster × ~10s period × N critical FQDNs = ~100N dig qps against upstream. RandomizedDelaySec=5s only spreads across 5s of a 10s window, so peak smoothing is ~50%. what's the typical critical-FQDN count, and have we measured this against VNet DNS / 168.63.129.16 headroom? agentbaker fix: run aptmarkwalinuxagent hold operation on foreground #7797 is the precedent — provisioning-timing changes that affect upstream load need a production metric + canary before fleet rollout.

OnUnitInactiveSec vs OnUnitActiveSec. you also changed the semantics. OnUnitInactiveSec=10s fires 10s after the script completes, not every 10s. If aks-localdns-hosts-setup.sh takes 3–5s (multiple dig calls with timeout 3, can retry across upstream servers), the effective period is ~13–15s. Either is fine, but the inline comment still says "10 seconds after each run completes" — OnUnitActiveSec would actually deliver the "every ~10s" reading more people expect. Pick one and make the comment match the semantics.

AccuracySec=1s. default is 1min, which lets systemd coalesce wake-ups across timers. dropping to 1s defeats that batching — small per-node cost but real at fleet scale and not free on battery-constrained / low-end SKUs. is 1s accuracy actually load-bearing here when the wall-clock target is ~10s and the script itself takes longer than that?

why 10s specifically? the prior comment said "15 minutes balances freshness against unnecessary DNS traffic" — that's gone now. what data drove 10s vs 30s vs 1min? if there was a specific stale-IP incident, link the ICM so the next person reading this .timer knows what 10s is calibrated against and when it can be relaxed again.

e2e / perf signal. plan to canary on a small region and watch (a) upstream DNS qps, (b) node CPU/journald write rate from the per-run script logs, (c) any drift in aks-localdns-hosts-setup.service failures before rolling everywhere?

dns server load calculation is off. dns server throttling limit is against per vm, azure dns is 1000 qps per vm. so number of nodes does not matter in this calculation.
do you dig in parallel? or in batches? OnUnitInactiveSec=10s is fine.
1s is fine, since you already randomized delay 5s.
by design
n/a

yewmsft · 2026-06-02T16:13:36Z

stop re-review

Update the AKS LocalDNS hosts setup systemd timer to refresh every 10 seconds instead of every 15 minutes. Tighten timer accuracy to 1 second so the shorter cadence is honored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use OnUnitInactiveSec so the hosts setup service waits 10 seconds after each run completes before scheduling the next run. Add RandomizedDelaySec to de-synchronize nodes and reduce fleet-wide DNS bursts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

 # Run immediately on boot
 OnBootSec=0
-# Refresh every 15 minutes. AKS critical FQDN IPs can change due to load balancer
-# rotation, Traffic Manager failover, or regional DNS updates. 15 minutes balances
-# freshness against unnecessary DNS traffic — stale IPs would cause the hosts plugin
-# to serve unreachable addresses until the next refresh.
-OnUnitActiveSec=15min
+# Refresh 10 seconds after each run completes. AKS critical FQDN IPs can change
+# due to load balancer rotation, Traffic Manager failover, or regional DNS updates.
+# Frequent refreshes keep stale IPs from causing the hosts plugin to serve


Copilot AI review requested due to automatic review settings June 2, 2026 00:27

saewoni requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 2, 2026 00:27

Copilot started reviewing on behalf of saewoni June 2, 2026 00:27 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/aks-localdns-hosts-setup.timer Outdated

saewoni changed the title ~~Reduce hosts plugin refresh interval~~ fix: reduce hosts plugin refresh interval Jun 2, 2026

awesomenix approved these changes Jun 3, 2026

View reviewed changes

Ubuntu and others added 2 commits June 3, 2026 20:55

Reduce hosts plugin refresh interval

baf6c08

Update the AKS LocalDNS hosts setup systemd timer to refresh every 10 seconds instead of every 15 minutes. Tighten timer accuracy to 1 second so the shorter cadence is honored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 3, 2026 20:57

saewoni force-pushed the reduce-hosts-plugin-refresh-10s branch from 30b6517 to b3d5d92 Compare June 3, 2026 20:57

Copilot started reviewing on behalf of saewoni June 3, 2026 20:57 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce hosts plugin refresh interval#8620

fix: reduce hosts plugin refresh interval#8620
saewoni wants to merge 2 commits into
mainfrom
reduce-hosts-plugin-refresh-10s

saewoni commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

yewmsft commented Jun 2, 2026

Uh oh!

yewmsft commented Jun 2, 2026 •

edited

Loading

Uh oh!

yewmsft commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

saewoni commented Jun 2, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

yewmsft commented Jun 2, 2026

Uh oh!

yewmsft commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewmsft commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewmsft commented Jun 2, 2026 •

edited

Loading