Skip to content

feat(agent): use context.WithCancel to handle graceful shutdowns#613

Open
dajneem23 wants to merge 2 commits intouber:masterfrom
dajneem23:fix/510-graceful-shutdown
Open

feat(agent): use context.WithCancel to handle graceful shutdowns#613
dajneem23 wants to merge 2 commits intouber:masterfrom
dajneem23:fix/510-graceful-shutdown

Conversation

@dajneem23
Copy link
Copy Markdown

What?

Replace the use of log.Fatal throughout the command runner with a context‑based cancellation mechanism, enabling the Run function to implement a proper “stopper” and perform graceful cleanup before exiting.
Fixes #510

Why?

Currently, any fatal error in cmd.Run or its sub‑functions calls log.Fatal, which immediately terminates the process with os.Exit(1). This prevents deferred cleanup (closing files, flushing logs, draining connections) and makes it impossible to implement a graceful shutdown routine.

The proposed pattern uses context.WithCancel to signal that the application should stop. The Run function can:

Pass the cancellable context to all downstream operations.
Cancel the context upon any fatal error (instead of calling log.Fatal).
Listen for context cancellation (<-ctx.Done()) and execute a controlled teardown sequence before allowing the process to exit normally.

How?

  • Create a root context with context.WithCancel at the beginning of the command’s Run method.
  • Propagate this context to all long‑running goroutines, servers, and API calls.
  • Where a fatal error is detected, call cancel() instead of log.Fatal.
  • In the main goroutine, block on <-ctx.Done() (or an errgroup) to wait for all workers to stop.
  • Perform any necessary cleanup (closing resources, stopping servers, etc.) after the context is cancelled.
  • Rely on a deferred cancel() to guarantee the context is released even if the function returns early.
  • This approach does not change the program’s exit code behavior—top‑level error handling still returns 1—but now all deferred cleanup has a chance to run.

Checklist?

  • Replace all log.Fatal calls inside cmd/ with cancel() + error return
  • Add context propagation to sub‑commands and shared utilities
  • Add signal handling (e.g., SIGINT, SIGTERM) that also triggers cancel()
  • Ensure any fire‑and‑forget goroutines respect ctx.Done()
  • Test graceful shutdown on both manual signals and simulated fatal errors

Test coverage

go test -coverprofile=coverage.out ./agent/cmd/...
go tool cover -func=coverage.out

github.com/uber/kraken/agent/cmd/cmd.go:59:     ParseFlags              100.0%
github.com/uber/kraken/agent/cmd/cmd.go:96:     WithConfig              100.0%
github.com/uber/kraken/agent/cmd/cmd.go:101:    WithMetrics             100.0%
github.com/uber/kraken/agent/cmd/cmd.go:106:    WithLogger              100.0%
github.com/uber/kraken/agent/cmd/cmd.go:111:    WithEffect              100.0%
github.com/uber/kraken/agent/cmd/cmd.go:116:    Run                     19.8%
github.com/uber/kraken/agent/cmd/cmd.go:305:    waitForShutdown         100.0%
github.com/uber/kraken/agent/cmd/cmd.go:321:    validateRequiredPorts   100.0%
github.com/uber/kraken/agent/cmd/cmd.go:343:    Chan                    0.0%
github.com/uber/kraken/agent/cmd/cmd.go:347:    Stop                    0.0%
github.com/uber/kraken/agent/cmd/cmd.go:353:    heartbeat               100.0%
total:                                          (statements)            38.9%

Copilot AI review requested due to automatic review settings May 4, 2026 07:26
@dajneem23 dajneem23 changed the title Fix/510 graceful shutdown Fix(Agent): graceful shutdown May 4, 2026
Comment thread agent/cmd/cmd.go Outdated
}()

go func() {
if err := nginx.Run(config.Nginx, map[string]interface{}{
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sambhav-jain-16 Should we add a nginx.RunContext variant (accepting a context.Context) to support graceful shutdown ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes plz

@dajneem23 dajneem23 changed the title Fix(Agent): graceful shutdown Feat(agent): graceful shutdown May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to make the agent daemon's startup/runtime path context-aware so fatal conditions and OS signals can trigger shutdown through agent/cmd.Run instead of calling log.Fatal from deep inside the command stack. It fits into the agent entrypoint and command runner, which are responsible for wiring up the agent HTTP server, registry, nginx, metrics, and supporting background work.

Changes:

  • Add signal-aware root-context handling in agent/main.go and switch cmd.Run to return an error.
  • Refactor agent/cmd.Run to use context.WithCancel, replace many internal log.Fatal calls with returned errors, and introduce waitForShutdown.
  • Add unit tests around shutdown coordination helpers and the HTTP server shutdown pattern.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
agent/main.go Wraps agent startup in a signal-aware context and handles cmd.Run errors at the process entrypoint.
agent/cmd/cmd.go Refactors the agent runner to use cancelable context flow, background goroutines, and shutdown/error coordination.
agent/cmd/cmd_test.go Adds tests for waitForShutdown behavior and the HTTP server shutdown guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread agent/cmd/cmd.go Outdated
Comment on lines +265 to +268
go func() {
<-ctx.Done()
if err := httpServer.Shutdown(context.Background()); err != nil {
log.Errorf("agent server shutdown: %s", err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Also, wouldn't it be better to use golang.org/x/sync/errgroup to run the goroutines this way you don't need to write a different function to wait for shutdown?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, but I'm concerned about the registry.ListenAndServe() comes from github.com/docker/distribution and the Registry.server field is unexported with no public Shutdown() method. The only shutdown hook in that package is an internal SIGTERM listener. So errgroup.Wait() would block forever on the registry goroutine during a clean shutdown.

For Nginx, I added nginx.RunContext which sends SIGQUIT on context cancellation, so nginx drains gracefully. httpServer is also covered via the synchronous httpServer.Shutdown() call after waitForShutdown returns.

The registry is the last piece. I'd suggest we track that as a follow-up: either add a Shutdown wrapper to our docker/distribution fork or expose the *http.Server so we can call Shutdown on it externally. Once that's in place, we can replace all of this with a clean errgroup.

Would you be happy if I opened a follow-up issue for the registry shutdown?

Comment thread agent/cmd/cmd.go
Comment on lines +282 to +286
go func() {
if err := nginx.Run(config.Nginx, map[string]interface{}{
"allowed_cidrs": config.AllowedCidrs,
"port": flags.AgentRegistryPort,
"registry_server": nginx.GetServer(
Comment thread agent/cmd/cmd.go Outdated
Comment on lines +273 to +278
go func() {
if err := registry.ListenAndServe(); err != nil {
stopHeartbeat()
log.Fatal(err)
log.Errorf("registry exited: %s", err)
errCh <- err
cancel()
Comment thread agent/main.go Outdated
Comment on lines +36 to +37
if err := ctx.Err(); err != nil && err != context.Canceled {
log.Fatalf("context error: %s", err)
Comment thread agent/cmd/cmd.go
Comment on lines +282 to +286
go func() {
if err := nginx.Run(config.Nginx, map[string]interface{}{
"allowed_cidrs": config.AllowedCidrs,
"port": flags.AgentRegistryPort,
"registry_server": nginx.GetServer(
@dajneem23 dajneem23 changed the title Feat(agent): graceful shutdown feat(agent): use context.WithCancel to handle graceful shutdowns May 4, 2026
@dajneem23
Copy link
Copy Markdown
Author

@sambhav-jain-16 to fix TestRingLocationsDistribution CI, should we increase sampleSize?

Error:      	Max difference between 0.5 and 0.555 allowed is 0.05, but difference was -0.05500000000000005

Copy link
Copy Markdown
Collaborator

@sambhav-jain-16 sambhav-jain-16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start.
Some comments

Comment thread agent/cmd/cmd.go
Comment on lines 257 to +258
go func() {
if err := http.ListenAndServe(addr, agentServer.Handler()); err != nil {
if err := httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cancel on all errors, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sambhav-jain-16

From https://pkg.go.dev/net/http#Server.Shutdown

" Shutdown does not attempt to close nor wait for hijacked connections such as WebSockets. [...] When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return."

http.ErrServerClosed is returned by ListenAndServe exactly when Shutdown() or Close() is called on the server. Here's the flow in this code:

  • OS sends SIGTERM/SIGINT → signal.NotifyContext in main.go cancels the top-level ctx
  • ctx.Done() fires → the shutdown goroutine calls httpServer.Shutdown(context.Background())
  • Shutdown() drains in-flight requests, then causes ListenAndServe to return http.ErrServerClosed

Comment thread agent/cmd/cmd.go Outdated
}()

go func() {
if err := nginx.Run(config.Nginx, map[string]interface{}{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes plz

Comment thread agent/main.go
func main() {
cmd.Run(cmd.ParseFlags(), cmd.WithEffect(func() {
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change the exit code mechanism.

Copy link
Copy Markdown
Author

@dajneem23 dajneem23 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change the exit code mechanism.

Agreed — and we didn't. The exit code mechanism is unchanged from master. log.Fatal(err) is still the last line in main, so any fatal error still terminates with exit code 1 via os.Exit(1) exactly as before.
cmd.Run needs a context. Context that is cancelled on SIGINT/SIGTERM.

cmd.Run now takes ctx—the context is passed down so internal goroutines (HTTP server, nginx) can shut down gracefully when a signal arrives, instead of being killed abruptly.

Is this acceptable?

Comment thread agent/cmd/cmd.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be an error now, right rather than panic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be an error now, right rather than panic?

make sense

Comment thread agent/cmd/cmd.go Outdated
Comment on lines +265 to +268
go func() {
<-ctx.Done()
if err := httpServer.Shutdown(context.Background()); err != nil {
log.Errorf("agent server shutdown: %s", err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Also, wouldn't it be better to use golang.org/x/sync/errgroup to run the goroutines this way you don't need to write a different function to wait for shutdown?

Comment thread agent/cmd/cmd.go Outdated
addr := fmt.Sprintf(":%d", flags.AgentServerPort)
log.Infof("Starting agent server on %s", addr)
log.InfoS("starting agent server", "addr", addr)
errCh := make(chan error, 2)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of buffered channel is not recommended, especially in case of errors, I recommend we use errgoup

Comment thread utils/log/klog.go Outdated
@dajneem23 dajneem23 requested a review from sambhav-jain-16 May 6, 2026 01:40
@dajneem23 dajneem23 force-pushed the fix/510-graceful-shutdown branch from 58508b9 to 6f60d13 Compare May 7, 2026 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(cmd): use context.WithCancel to handle graceful shudowns

3 participants