AWS CloudWatch Fundamentals
CloudWatch is AWS’s observability service — metrics, logs, alarms, dashboards, and traces. It’s not one product but a family of loosely related services that share the CloudWatch brand. Every other AWS service publishes to it by default, so CloudWatch is where you look first when something’s wrong.
What CloudWatch is (the big picture)
| Sub-service | What it does |
|---|---|
| CloudWatch Metrics | Numeric time-series (CPU, requests, latency, custom values) |
| CloudWatch Logs | Log aggregation from apps, services, agents |
| CloudWatch Alarms | Threshold-based notifications on metrics |
| CloudWatch Dashboards | Visualisations combining metrics & logs |
| CloudWatch Events / EventBridge | Event bus for AWS and app events (EventBridge is the modern name) |
| CloudWatch Logs Insights | SQL-ish query language over logs |
| CloudWatch Synthetics | Scripted “canary” checks probing endpoints |
| CloudWatch RUM | Real-user monitoring for browser apps |
| CloudWatch Application Signals / ServiceLens | Service-level overview |
| X-Ray | Distributed tracing (related but technically separate service) |
| Container Insights / Lambda Insights | Pre-built dashboards for ECS/EKS/Lambda |
Pricing is per sub-service and can add up — logs ingestion and high-resolution custom metrics are the common bill surprises.
Metrics — the numeric spine
A metric is a time series identified by three things:
- Namespace — e.g.
AWS/EC2,AWS/Lambda,Custom/myapp - Metric name —
CPUUtilization,Invocations,OrderCount - Dimensions — key/value pairs further scoping:
{InstanceId: i-123},{FunctionName: my-fn}
Every unique combination of namespace × name × dimension set is a distinct metric (and a distinct bill line for custom metrics).
Resolution and retention
- Standard resolution: 1-minute granularity
- High resolution: 1-second granularity (costs ~3× more; used for latency-sensitive alarms)
- Retention: 15 months automatic (aggregated at coarser intervals over time)
- 60-second data: retained for 15 days
- 5-min aggregates: 63 days
- 1-hour aggregates: 455 days (15 months)
AWS-native vs custom metrics
- Native — EC2, RDS, ALB, Lambda, S3, etc. publish metrics by default. Basic monitoring = 5-min; detailed = 1-min (paid).
- Custom — your app publishes via
PutMetricDataor Embedded Metric Format (EMF) (JSON in logs that CloudWatch auto-parses into metrics — the efficient path for Lambda/containers).
Statistics and math
You query metrics by statistic: Sum, Average, Minimum, Maximum, SampleCount, percentiles (p50, p95, p99).
Metric Math lets you compose expressions: m1 / m2 * 100, ANOMALY_DETECTION_BAND(m1, 2), etc. Useful for derived KPIs in dashboards and alarms.
Logs — the string-shaped half
CloudWatch Logs organises log data into:
Log Group → a logical container (e.g. /aws/lambda/my-fn, /var/log/app)
Log Stream → a source within the group (e.g. per container, per file)
Events → timestamp + message
How logs get in:
- AWS services publish directly (Lambda, API Gateway, VPC Flow Logs, CloudTrail, etc.)
- CloudWatch Agent on EC2/on-prem — sends OS and app logs
- Container log drivers —
awslogsdriver in Docker/ECS; sidecar/Fluent Bit in EKS - Direct API —
PutLogEvents
Retention and storage classes
- Default retention: “never expire” — a classic cost trap. Set a retention policy on every log group you care about.
- Infrequent Access (IA) class — cheaper storage for logs you might query rarely
- Log Group subscriptions — stream logs to Lambda, Kinesis, Firehose, OpenSearch for downstream processing
Logs Insights — the query language
A purpose-built query language for CloudWatch Logs:
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
| sort @timestamp desc
| limit 100
Fast for ad-hoc investigation. Not as powerful as OpenSearch / Splunk for complex correlation, but good enough for most incidents.
Metric filters
A metric filter scans log events and emits a metric when matched. Classic pattern: “count of ERROR messages → custom metric → alarm.”
Filter pattern: ERROR
Value: 1
Namespace: MyApp
Name: ErrorCount
This is how you turn unstructured logs into alertable data.
Alarms
An alarm watches one metric (or one metric-math expression) and transitions through states:
INSUFFICIENT_DATA— not enough data points yetOK— within thresholdALARM— threshold breached
Triggers: SNS topic, Auto Scaling action, EC2 action (reboot/stop/terminate/recover), Systems Manager OpsItem, Lambda via EventBridge.
Key knobs
- Threshold + comparison —
> 80,<= 5, etc. - Evaluation periods / Datapoints to alarm — “3 out of 5 periods > 80” reduces flappiness
- Treat missing data —
notBreaching/breaching/ignore/missing— critical for alarms that can go quiet intentionally - Anomaly detection — ML-based baseline; breach = “outside the expected band” rather than fixed threshold
Composite alarms
Combine multiple alarms with boolean logic:
(CPUHigh AND MemHigh) OR DiskFullUseful to reduce noise — only page if multiple symptoms coincide.
CloudWatch Agent — the universal collector
The CloudWatch Agent is a cross-platform daemon for EC2 / on-prem that collects:
- System metrics not in the default EC2 set (memory, disk, swap, custom procstat)
- Log files
- StatsD / collectd metrics
Configured via JSON, installed via SSM or user-data. On modern instances it replaces the older CloudWatch Logs Agent + custom memory-scripts dance.
Why it matters: EC2’s default metrics don’t include memory or disk-space-used (surprising, because those are most common alarms). You install the agent or you live with the blind spots.
Dashboards
JSON-defined widgets combining metrics, logs, text, and alarms. Share across accounts via cross-account observability. Also programmable via IaC.
Dashboard tip: one dashboard per service/team, not one mega-dashboard. People don’t scroll; they skim.
EventBridge (the CloudWatch Events evolution)
EventBridge is technically a separate service now, but the lineage is “CloudWatch Events + schema registry + SaaS partner buses.” Pattern:
Source → event bus → rule (pattern matching) → target
Typical AWS sources: any AWS service state change, CloudTrail API calls, scheduled rules (cron). Targets: Lambda, SQS, SNS, Step Functions, another event bus.
Canonical uses:
- “When an EC2 instance stops, do X” — without writing a poller
- Scheduled jobs (cron replacements)
- Cross-account event routing via bus-to-bus
Cost traps
- Custom metrics bill. Each unique namespace × name × dimension set is ~$0.30/month baseline + per-million
PutMetricData. Apps emitting metrics with high-cardinality dimensions (request ID, user ID) generate thousands of distinct metrics. Use EMF + aggregation, and keep dimensions coarse. - Log retention “never expire” — set retention on every log group. Old logs that nobody reads still cost storage.
- High-resolution metrics cost more — don’t use 1-sec resolution unless your alarm needs sub-minute reaction.
- Container Insights enhanced mode charges per container — reasonable for prod, expensive across dev clusters.
- Cross-region dashboards / metric queries — pull data from each region; noisier billing line.
How everything fits together
APP / LAMBDA / EC2
│ │ │ │
│ │ │ └──→ CloudWatch Agent ──→ Logs + System Metrics
│ │ └──────→ Service Native ────→ Namespace metrics
│ └──────────── EMF JSON ──────────→ Auto-parsed custom metrics
└───────────────── PutMetricData ─────→ Custom metrics
Logs ──→ Metric Filter ──→ Alarm ──→ SNS ──→ PagerDuty
Metrics ─→ Alarm ──→ Auto Scaling / SNS / EventBridge
Metrics ─→ Dashboard
Events ──→ EventBridge ──→ Lambda / Target
Alarms ──→ Composite ──→ Single noise-reduced page
Common pitfalls
- Missing the blind spots. EC2 doesn’t emit memory metrics — install CW Agent.
- Alarms on
Averagewhen you wantMaxorp99.Averagehides tail spikes. - Alarms going silent during outages. If your Lambda crashes, invocations = 0, and “Errors > threshold” never fires because there’s no data. Use
Treat missing data as breaching. - Infinite log retention. Set it or pay forever.
- High-cardinality custom dimensions exploding the metric count. Use EMF with aggregation.
- Alarms on the wrong metric. E.g. “CPU > 80%” on an auto-scaling web tier — irrelevant; what matters is request latency or 5xx rate. Align alarms with SLOs.
- Dashboards nobody reads. Focus on the handful of metrics that tell the SLO story.
Mental model
- Metrics = numbers in time. Alarms watch them. Dashboards display them.
- Logs = strings. Logs Insights queries them; metric filters convert them to numbers.
- Alarms = eventing layer bridging metrics to humans / actions.
- EventBridge = AWS + SaaS event bus that uses CloudWatch’s rule engine.
- X-Ray = traces — spans joining across services.
- Everything flows here by default — which is why the first place to investigate an AWS incident is CloudWatch, and the first ops lesson in AWS is “set alarms, retention, and the agent.”