Platform investigation and evidence handoff Risk: Medium January 12, 2026

OpenShift Egress Investigation

A sanitized example of tracing intermittent outbound connectivity through workload, node, cluster policy, and perimeter control points before recommending a narrow change.

OpenShiftKubernetesLinuxnetwork policyegress controls #egress#networking#incident-review

Summary

A sanitized example of tracing intermittent outbound connectivity through workload, node, cluster policy, and perimeter control points before recommending a narrow change.

Environment

Platform	Sanitized OpenShift cluster on Linux worker nodes
Scope	Selected workloads with intermittent outbound connectivity failures
Observability	Workload tests, node placement, policy review, logs, and change records
Data policy	No destination names, internal routes, hostnames, or provider-specific details included

Problem

Application teams reported intermittent outbound failures from selected workloads to an external dependency. The symptom initially looked like a general cluster issue, but the failures did not affect every namespace, every node, or every test path.

Risk / Control

Production traffic could not be disrupted for broad packet capture.
Broad firewall or cluster-policy changes were not acceptable without a narrow hypothesis.
Evidence had to be useful to both platform and network teams.
Public notes could not include destinations, internal routes, hostnames, or sensitive topology.
Rollback criteria and post-change tests had to be defined before any egress change.

Rollback criteria

The change would be rolled back if the scoped workloads still failed post-change checks, unrelated namespaces showed new egress failures, or platform/network telemetry showed unexpected behavior outside the affected path.

Timeline / Investigation

The investigation compared successful and failing workloads, checked pod placement, reviewed route and network policy intent, and correlated node-level symptoms with recent change records. A short test matrix separated DNS resolution, TCP reachability, proxy behavior, application retry behavior, and node-specific variance.

Evidence collected

Successful and failing workload comparisons.
Pod placement and node-level variance.
Namespace routing and network policy review.
DNS, TCP reachability, proxy behavior, and retry behavior checks.
Recent change records and application-owner observations.

Observability data was used to avoid treating a single failed request as root cause. The working notes separated confirmed facts, unproven hypotheses, and items that required another team to validate.

The strongest evidence pointed to a narrow egress control path rather than a general OpenShift failure. The proposed change was limited to the affected destination class and paired with pre-change checks, post-change checks, and an explicit fallback path.

Decision record

The decision was to make a narrow egress-control update instead of a broad cluster or firewall change. The record tied the recommendation to scoped workload tests and documented which assumptions required network-team confirmation.

Validation criteria

Validation required successful pre-defined egress checks from the affected workloads, no new failures from unaffected namespaces, and application-owner confirmation that the original symptom was no longer reproducible.

Result

The affected traffic path was corrected without making a cluster-wide change. The incident review produced a reusable egress investigation checklist that starts with scope, baseline signals, rollback trigger, and ownership boundaries.

Lessons Learned

Egress issues are easier to resolve when ownership assumptions are delayed until evidence supports them. A compact test matrix, clear observability notes, and a rollback-aware change record reduced operational risk.