AWS CloudWatch Fundamentals

CloudWatch is AWS’s observability service — metrics, logs, alarms, dashboards, and traces. It’s not one product but a family of loosely related services that share the CloudWatch brand. Every other AWS service publishes to it by default, so CloudWatch is where you look first when something’s wrong.

What CloudWatch is (the big picture)

Sub-service	What it does
CloudWatch Metrics	Numeric time-series (CPU, requests, latency, custom values)
CloudWatch Logs	Log aggregation from apps, services, agents
CloudWatch Alarms	Threshold-based notifications on metrics
CloudWatch Dashboards	Visualisations combining metrics & logs
CloudWatch Events / EventBridge	Event bus for AWS and app events (EventBridge is the modern name)
CloudWatch Logs Insights	SQL-ish query language over logs
CloudWatch Synthetics	Scripted “canary” checks probing endpoints
CloudWatch RUM	Real-user monitoring for browser apps
CloudWatch Application Signals / ServiceLens	Service-level overview
X-Ray	Distributed tracing (related but technically separate service)
Container Insights / Lambda Insights	Pre-built dashboards for ECS/EKS/Lambda

Pricing is per sub-service and can add up — logs ingestion and high-resolution custom metrics are the common bill surprises.

Metrics — the numeric spine

A metric is a time series identified by three things:

Namespace — e.g. AWS/EC2, AWS/Lambda, Custom/myapp
Metric name — CPUUtilization, Invocations, OrderCount
Dimensions — key/value pairs further scoping: {InstanceId: i-123}, {FunctionName: my-fn}

Every unique combination of namespace × name × dimension set is a distinct metric (and a distinct bill line for custom metrics).

Resolution and retention

Standard resolution: 1-minute granularity
High resolution: 1-second granularity (costs ~3× more; used for latency-sensitive alarms)
Retention: 15 months automatic (aggregated at coarser intervals over time)
- 60-second data: retained for 15 days
- 5-min aggregates: 63 days
- 1-hour aggregates: 455 days (15 months)

AWS-native vs custom metrics

Native — EC2, RDS, ALB, Lambda, S3, etc. publish metrics by default. Basic monitoring = 5-min; detailed = 1-min (paid).
Custom — your app publishes via PutMetricData or Embedded Metric Format (EMF) (JSON in logs that CloudWatch auto-parses into metrics — the efficient path for Lambda/containers).

Statistics and math

You query metrics by statistic: Sum, Average, Minimum, Maximum, SampleCount, percentiles (p50, p95, p99).

Metric Math lets you compose expressions: m1 / m2 * 100, ANOMALY_DETECTION_BAND(m1, 2), etc. Useful for derived KPIs in dashboards and alarms.

Logs — the string-shaped half

CloudWatch Logs organises log data into:

Log Group   → a logical container (e.g. /aws/lambda/my-fn, /var/log/app)
  Log Stream → a source within the group (e.g. per container, per file)
    Events   → timestamp + message

How logs get in:

AWS services publish directly (Lambda, API Gateway, VPC Flow Logs, CloudTrail, etc.)
CloudWatch Agent on EC2/on-prem — sends OS and app logs
Container log drivers — awslogs driver in Docker/ECS; sidecar/Fluent Bit in EKS
Direct API — PutLogEvents

Retention and storage classes

Default retention: “never expire” — a classic cost trap. Set a retention policy on every log group you care about.
Infrequent Access (IA) class — cheaper storage for logs you might query rarely
Log Group subscriptions — stream logs to Lambda, Kinesis, Firehose, OpenSearch for downstream processing

Logs Insights — the query language

A purpose-built query language for CloudWatch Logs:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
| sort @timestamp desc
| limit 100

Fast for ad-hoc investigation. Not as powerful as OpenSearch / Splunk for complex correlation, but good enough for most incidents.

Metric filters

A metric filter scans log events and emits a metric when matched. Classic pattern: “count of ERROR messages → custom metric → alarm.”

Filter pattern: ERROR
Value: 1
Namespace: MyApp
Name: ErrorCount

This is how you turn unstructured logs into alertable data.

Alarms

An alarm watches one metric (or one metric-math expression) and transitions through states:

INSUFFICIENT_DATA — not enough data points yet
OK — within threshold
ALARM — threshold breached

Triggers: SNS topic, Auto Scaling action, EC2 action (reboot/stop/terminate/recover), Systems Manager OpsItem, Lambda via EventBridge.

Key knobs

Threshold + comparison — > 80, <= 5, etc.
Evaluation periods / Datapoints to alarm — “3 out of 5 periods > 80” reduces flappiness
Treat missing data — notBreaching / breaching / ignore / missing — critical for alarms that can go quiet intentionally
Anomaly detection — ML-based baseline; breach = “outside the expected band” rather than fixed threshold

Composite alarms

Combine multiple alarms with boolean logic:

(CPUHigh AND MemHigh) OR DiskFull Useful to reduce noise — only page if multiple symptoms coincide.

CloudWatch Agent — the universal collector

The CloudWatch Agent is a cross-platform daemon for EC2 / on-prem that collects:

System metrics not in the default EC2 set (memory, disk, swap, custom procstat)
Log files
StatsD / collectd metrics

Configured via JSON, installed via SSM or user-data. On modern instances it replaces the older CloudWatch Logs Agent + custom memory-scripts dance.

Why it matters: EC2’s default metrics don’t include memory or disk-space-used (surprising, because those are most common alarms). You install the agent or you live with the blind spots.

Dashboards

JSON-defined widgets combining metrics, logs, text, and alarms. Share across accounts via cross-account observability. Also programmable via IaC.

Dashboard tip: one dashboard per service/team, not one mega-dashboard. People don’t scroll; they skim.

EventBridge (the CloudWatch Events evolution)

EventBridge is technically a separate service now, but the lineage is “CloudWatch Events + schema registry + SaaS partner buses.” Pattern:

Source → event bus → rule (pattern matching) → target

Typical AWS sources: any AWS service state change, CloudTrail API calls, scheduled rules (cron). Targets: Lambda, SQS, SNS, Step Functions, another event bus.

Canonical uses:

“When an EC2 instance stops, do X” — without writing a poller
Scheduled jobs (cron replacements)
Cross-account event routing via bus-to-bus

Cost traps

Custom metrics bill. Each unique namespace × name × dimension set is ~$0.30/month baseline + per-million PutMetricData. Apps emitting metrics with high-cardinality dimensions (request ID, user ID) generate thousands of distinct metrics. Use EMF + aggregation, and keep dimensions coarse.
Log retention “never expire” — set retention on every log group. Old logs that nobody reads still cost storage.
High-resolution metrics cost more — don’t use 1-sec resolution unless your alarm needs sub-minute reaction.
Container Insights enhanced mode charges per container — reasonable for prod, expensive across dev clusters.
Cross-region dashboards / metric queries — pull data from each region; noisier billing line.

How everything fits together

   APP / LAMBDA / EC2
   │    │    │   │
   │    │    │   └──→ CloudWatch Agent ──→ Logs + System Metrics
   │    │    └──────→ Service Native ────→ Namespace metrics
   │    └──────────── EMF JSON ──────────→ Auto-parsed custom metrics
   └───────────────── PutMetricData ─────→ Custom metrics

           Logs  ──→  Metric Filter  ──→  Alarm ──→  SNS ──→  PagerDuty
           Metrics ─→  Alarm         ──→  Auto Scaling / SNS / EventBridge
           Metrics ─→  Dashboard
           Events ──→  EventBridge   ──→  Lambda / Target
           Alarms ──→  Composite     ──→  Single noise-reduced page

Common pitfalls

Missing the blind spots. EC2 doesn’t emit memory metrics — install CW Agent.
Alarms on Average when you want Max or p99. Average hides tail spikes.
Alarms going silent during outages. If your Lambda crashes, invocations = 0, and “Errors > threshold” never fires because there’s no data. Use Treat missing data as breaching.
Infinite log retention. Set it or pay forever.
High-cardinality custom dimensions exploding the metric count. Use EMF with aggregation.
Alarms on the wrong metric. E.g. “CPU > 80%” on an auto-scaling web tier — irrelevant; what matters is request latency or 5xx rate. Align alarms with SLOs.
Dashboards nobody reads. Focus on the handful of metrics that tell the SLO story.

Mental model

Metrics = numbers in time. Alarms watch them. Dashboards display them.
Logs = strings. Logs Insights queries them; metric filters convert them to numbers.
Alarms = eventing layer bridging metrics to humans / actions.
EventBridge = AWS + SaaS event bus that uses CloudWatch’s rule engine.
X-Ray = traces — spans joining across services.
Everything flows here by default — which is why the first place to investigate an AWS incident is CloudWatch, and the first ops lesson in AWS is “set alarms, retention, and the agent.”

IT Knowledge DB

Explorer

AWS CloudWatch Fundamentals

AWS CloudWatch Fundamentals

What CloudWatch is (the big picture)

Metrics — the numeric spine

Resolution and retention

AWS-native vs custom metrics

Statistics and math

Logs — the string-shaped half

Retention and storage classes

Logs Insights — the query language

Metric filters

Alarms

Key knobs

Composite alarms

CloudWatch Agent — the universal collector

Dashboards

EventBridge (the CloudWatch Events evolution)

Cost traps

How everything fits together

Common pitfalls

Mental model

See also

Graph View

Table of Contents

Backlinks