Infrastructure investigation and change planning Risk: Medium January 16, 2026

Linux I/O and Memory Pressure Investigation

A sanitized investigation of Linux latency symptoms where disk I/O, memory reclaim, and service behavior had to be separated before any host-level change was made.

Linuxsystemdblock storagevmstatiostatjournal logs #performance#memory#io#incident-investigation

Summary

A sanitized investigation of Linux latency symptoms where disk I/O, memory reclaim, and service behavior had to be separated before any host-level change was made.

Environment

Platform	Linux service hosts with block storage
Scope	Selected service instances showing recurring latency symptoms
Observability	Host metrics, service logs, process state, disk latency, reclaim activity, and change records
Data policy	No customer payloads, hostnames, device names, or internal addresses included

Problem

A Linux service showed recurring latency spikes. The first report pointed toward storage because disk latency was visible during the incident windows, but memory pressure signals appeared at the same time. Treating the issue as a storage-only problem would have risked changing the wrong layer or masking the symptom with a restart.

The investigation goal was to identify the safest next action, not to force a single root cause too early. All notes were sanitized to remove hostnames, device names, customer context, internal addresses, and service-specific identifiers.

Risk / Control

Host-level changes required a rollback note and a post-change observation window.
Restarting the service without preserving evidence could hide the symptom pattern.
Storage changes were deferred until memory reclaim behavior was understood.
Validation signals were defined before change: latency trend, reclaim activity, major faults, disk queue behavior, and service log timing.
Stop conditions were documented so the change could be paused if latency, error rate, or host pressure moved in the wrong direction.

Rollback criteria

The change would be rolled back if latency increased, host pressure moved outside the baseline envelope, service logs showed new error patterns, or memory reclaim behavior did not improve during the observation window.

Timeline / Investigation

The investigation started by aligning the incident timeline with service logs, host metrics, and change history. The working notes separated confirmed observations from hypotheses so the team could avoid treating correlation as root cause.

Evidence collected

Service log timestamps around the latency windows.
Host memory pressure indicators, including reclaim activity and major faults.
Disk latency and queue behavior during and after the incident windows.
Process growth and service state observations.
Change records around the affected period.

The first pass compared disk latency, queue depth, CPU pressure, memory usage, major faults, reclaim activity, and process growth. The pattern showed that latency spikes aligned more closely with memory reclaim bursts than with sustained I/O saturation.

The next pass checked whether the storage layer remained unhealthy after memory pressure reduced. This avoided a common failure mode in performance work: changing storage settings because storage looked noisy during a host pressure event.

Decision record

The decision was to address memory pressure first and defer storage-level changes until the reclaim signal was controlled. This kept the action aligned with the strongest evidence and avoided changing a downstream symptom.

Validation criteria

Validation required lower reclaim pressure, no new service errors, latency behavior returning toward the observed baseline, and disk signals remaining stable after the memory pressure source was addressed.

Result

The recommended action focused on reducing the memory pressure source first, then validating disk behavior afterward. The post-change observation window showed improved latency behavior and supported the conclusion that disk symptoms were secondary to memory reclaim pressure.

No unsupported metrics are included here. The important result for the sanitized case study is the operating method: baseline first, change narrowly, validate afterward, and preserve a rollback path.

Lessons Learned

Performance investigations need timelines, not isolated screenshots. Logs, metrics, and change records should be aligned before a host-level change is proposed.

Disk latency during an incident window does not automatically mean disk is the root cause. Memory reclaim, process behavior, and service timing can make downstream layers appear unhealthy.

Validation criteria should be written before the change. That keeps the investigation honest and gives operators a clear way to decide whether to continue, pause, or roll back.