What is AWS Systems Manager?
A collection of tools to manage and operate your infrastructure (EC2, on-premises servers, containers) at scale. Core concept: SSM Agent runs on your instances and communicates with SSM service—no inbound ports needed.
Key Capabilities
| Category | Capability | What it does |
|---|---|---|
| Node Management | Fleet Manager | View and manage all servers from one console |
| Session Manager | SSH/RDP without opening ports or managing keys | |
| Run Command | Execute scripts on multiple instances at once | |
| Patch Manager | Automate OS and application patching | |
| State Manager | Keep instances in a defined configuration state | |
| Operations | OpsCenter | Central place to view and resolve operational issues |
| Incident Manager | Manage and respond to incidents | |
| Explorer | Dashboard showing operational data across accounts | |
| Application Management | Parameter Store | Store config values and secrets (free tier available) |
| AppConfig | Deploy application configuration safely with rollback | |
| Change Management | Automation | Run multi-step runbooks |
| Change Manager | Approve and track operational changes | |
| Maintenance Windows | Schedule operations during defined time windows |
Run Command
Execute commands on multiple instances without SSH.
Configuration
| Setting | What to specify | Details |
|---|---|---|
| Document | Which command/script to run | AWS-provided (e.g., AWS-RunShellScript) or custom |
| Targets | Which instances | By instance ID, tag, resource group |
| Parameters | Input to the document | Commands to run, timeout, working directory |
| Rate Control | How many at once | Concurrency (10 or 50%) and error threshold |
| Output | Where to store results | S3 bucket, CloudWatch Logs |
Common Documents
AWS-RunShellScript- Run bash on LinuxAWS-RunPowerShellScript- Run PowerShell on WindowsAWS-InstallApplication- Install software
Execution Flow
You send Run Command
│
▼
┌─────────────────────────────────────────────────────┐
│ SSM Service receives request │
│ Creates Command ID: cmd-0a1b2c3d4e5f │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ SSM Agent Polling (on each instance) │
│ │
│ Agent polls SSM service every 5 sec via HTTPS (443) │
│ "Any commands for me?" │
│ │
│ No inbound ports needed - agent initiates outbound │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Agent Executes Commands Locally │
│ │
│ $ yum update -y │
│ $ systemctl restart nginx │
│ │
│ Captures: stdout, stderr, exit code │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Agent Reports Results Back │
│ │
│ i-abc123: SUCCESS (exit code 0) │
│ i-def456: FAILED (exit code 1) │
│ │
│ Output stored in SSM console, S3, CloudWatch Logs │
└─────────────────────────────────────────────────────┘
Patch Manager
Automate patching of OS and applications.
Configuration
| Setting | What to specify | Details |
|---|---|---|
| Patch Baseline | Which patches to apply | Rules for auto-approval (severity, classification, days after release) |
| Patch Group | Which instances | Tag-based grouping (Patch Group = production) |
| Maintenance Window | When to patch | Schedule (cron), duration, cutoff time |
| Operation | Scan or Install | Scan = report only, Install = apply patches |
| Reboot Option | After patching | RebootIfNeeded, NoReboot, AlwaysReboot |
Patch Baseline Rules Example
Product: AmazonLinux2
Classification: Security, Bugfix
Severity: Critical, Important
Auto-approve: 7 days after release
Scan vs Install
- Scan: Only checks what’s missing, reports compliance status, no changes made
- Install: Checks + installs patches, reports what was installed, may reboot instance
Execution Flow
Maintenance Window triggers (e.g., Sunday 2:00 AM)
│
▼
┌─────────────────────────────────────────────────────┐
│ Identify Target Instances │
│ Patch Group tag: "Patch Group" = "production" │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Get Applicable Patches from Baseline │
│ │
│ Query patch repository: │
│ - kernel-5.10.102 (Security, Critical) │
│ - openssl-1.1.1k (Security, Important) │
│ - curl-7.79.1 (Bugfix) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Run AWS-RunPatchBaseline on Each Instance │
│ │
│ 1. Scan installed packages │
│ 2. Compare with baseline rules │
│ 3. Download missing patches │
│ 4. Install patches │
│ 5. Reboot if required │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Report Compliance │
│ │
│ i-abc123: COMPLIANT (3 patches installed) │
│ i-def456: NON_COMPLIANT (1 patch failed) │
└─────────────────────────────────────────────────────┘
State Manager
Keep instances in a desired configuration state continuously.
Configuration
| Setting | What to specify | Details |
|---|---|---|
| Association | Document + targets + schedule | The core unit of State Manager |
| Document | What configuration to enforce | AWS-GatherSoftwareInventory, AWS-ConfigureAWSPackage, custom |
| Targets | Which instances | By tag, instance ID, all instances |
| Schedule | How often to apply | Rate (every 30 min) or cron expression |
| Compliance Severity | How to report drift | Critical, High, Medium, Low |
Use Cases
- Ensure antivirus is always installed
- Keep CloudWatch agent configured
- Collect inventory on schedule
- Join instances to Active Directory
Execution Flow
Association created:
Document: AWS-ConfigureAWSPackage
Targets: tag:Role = WebServer
Schedule: rate(30 minutes)
Parameters: action=Install, name=AmazonCloudWatchAgent
│
▼ (every 30 minutes, or when new instance matches)
┌─────────────────────────────────────────────────────┐
│ Evaluate Targets │
│ │
│ Found: i-abc123, i-def456 │
│ New instance: i-ghi789 → Automatically included! │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Execute on Each Instance │
│ │
│ i-abc123 (agent already installed): │
│ Check → Already compliant → No action │
│ │
│ i-ghi789 (new, agent not installed): │
│ Check → Not installed → Install → Now compliant │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Track Compliance │
│ │
│ If someone uninstalls agent manually: │
│ → Next run (within 30 min) reinstalls it │
│ → "Desired state" continuously enforced │
└─────────────────────────────────────────────────────┘
Automation
Run multi-step workflows (runbooks) for complex operations.
Configuration
| Setting | What to specify | Details |
|---|---|---|
| Runbook | Which automation document | AWS-provided or custom (YAML/JSON) |
| Execution Mode | How to run | Simple, Rate control, Multi-account |
| Parameters | Input values | Instance IDs, AMI IDs, custom inputs |
| IAM Role | Permissions | AssumeRole for cross-account or elevated permissions |
| Targets | Which resources | Parameter values, tags, resource groups |
Execution Modes
| Mode | Behavior |
|---|---|
| Simple | Run once with specified parameters |
| Rate Control | Run on multiple targets with concurrency/error limits |
| Multi-account/Region | Run across multiple AWS accounts or regions |
Runbook Step Types
aws:executeAwsApi- Call any AWS APIaws:runCommand- Run command on instancesaws:executeScript- Run Python/PowerShell scriptaws:approve- Wait for manual approvalaws:branch- Conditional logicaws:sleep- Wait for durationaws:changeInstanceState- Start/stop/terminate
Common AWS Runbooks
AWS-StopEC2Instance- Stop instanceAWS-CreateImage- Create AMIAWS-PatchInstanceWithRollback- Patch with automatic rollback on failureAWS-RestartEC2Instance- Stop → wait → start
Execution Flow with Approval
Trigger automation
│
▼
┌─────────────────────────────────────────────────────┐
│ Step 1: aws:approve │
│ │
│ Status: WAITING │
│ Sends SNS notification to approvers │
│ Execution PAUSES here │
└─────────────────────────────────────────────────────┘
│
▼ (Approver clicks "Approve")
┌─────────────────────────────────────────────────────┐
│ Step 2: aws:changeInstanceState (stop) │
│ │
│ Calls ec2:StopInstances │
│ Status: SUCCESS │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Step 3: aws:waitForAwsResourceProperty │
│ │
│ Polls ec2:DescribeInstances │
│ Waiting for State = "stopped" │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Step 4: aws:changeInstanceState (start) │
│ │
│ Calls ec2:StartInstances │
│ Status: SUCCESS │
└─────────────────────────────────────────────────────┘
Error Handling
mainSteps:
- name: stopInstance
action: aws:changeInstanceState
onFailure: step:rollback # Go to rollback if fails
inputs:
DesiredState: stopped
- name: doMaintenance
action: aws:runCommand
onFailure: step:rollback
- name: startInstance
action: aws:changeInstanceState
onFailure: Abort # Stop execution entirely
isEnd: true
- name: rollback
action: aws:changeInstanceState
inputs:
DesiredState: running # Restore instance if earlier step failed
Rate Control Execution
Targets: 50 instances tagged Environment=Production
Concurrency: 10
Error Threshold: 5
Batch 1: i-001 to i-010 (parallel)
→ 1 failed, error count: 1
Batch 2: i-011 to i-020 (parallel)
→ 2 failed, error count: 3
Batch 3: i-021 to i-030 (parallel)
→ 2 failed, error count: 5 (threshold!)
→ STOP EXECUTION
→ Remaining 28 instances NOT processed
Inventory
Stores and queries metadata about your instances (not the instances themselves).
What AWS-GatherSoftwareInventory Collects
- Installed applications (name, version, publisher)
- AWS components (SSM Agent version, etc.)
- Network configuration (IP, MAC)
- Windows updates, services, roles
- Custom files you specify
Use Case
“Find all instances running Python 3.8” or “Which servers have outdated nginx?”
Comparison Summary
| Aspect | Run Command | Patch Manager | State Manager | Automation |
|---|---|---|---|---|
| Purpose | One-time command | Patch OS/apps | Maintain desired state | Multi-step workflows |
| Trigger | Manual, EventBridge | Maintenance Window | Schedule (continuous) | Manual, EventBridge |
| Frequency | Once | Periodic | Continuous | Once or scheduled |
| Idempotent | No | Yes | Yes | Depends on steps |
| Use case | Ad-hoc tasks | Security compliance | Drift prevention | Complex operations |
Notes
- SSM Agent: Must be installed and running on instances. Pre-installed on Amazon Linux, Windows AMIs
- IAM permissions: Instance needs IAM role with
AmazonSSMManagedInstanceCorepolicy - No inbound ports: Agent initiates outbound HTTPS to SSM endpoints
- Hybrid: Can manage on-premises servers with SSM Agent + activation