Aurora
AWS-managed relational database (MySQL/PostgreSQL compatible) with cloud-native architecture. Storage and compute are separated.
Aurora Cluster (Single Region)
One primary instance (read/write) + optional read replicas sharing the same storage.
Writer Endpoint Reader Endpoint
│ │
▼ ▼
┌──────────────┐ ┌──────────────┬──────────────┐
│ Primary │ │ Replica 1 │ Replica 2 │
│ (Writer) │ │ (Reader) │ (Reader) │
└──────┬───────┘ └──────┬───────┴──────┬───────┘
│ │ │
└─────────────┬───────────────┴──────────────┘
▼
┌────────────────────────────────────────┐
│ Shared Cluster Storage │
│ (6 copies across 3 AZs) │
│ Auto-grows up to 128 TB │
└────────────────────────────────────────┘
- All instances share same storage (no replication lag for storage)
- Replicas can be promoted to primary if primary fails (~30 seconds failover)
- Up to 15 read replicas
- Single region only
Aurora Storage
One logical storage automatically replicated across 3 AZs (6 copies total, 2 per AZ).
- Write: Need 4 of 6 copies to acknowledge (can lose 2)
- Read: Need 3 of 6 copies to respond (can lose 3)
- Even if entire AZ fails (2 copies gone), writes still work
Aurora Global Database (Multi-Region)
Multiple Aurora clusters across different AWS regions with replication between them.
Primary Region (us-east-1) Secondary Region (eu-west-1)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Primary Cluster │ │ Secondary Cluster │
│ ┌────────┐ ┌────────┐ │ │ ┌────────┐ ┌────────┐ │
│ │Primary │ │Replica │ │ │ │Replica │ │Replica │ │
│ │(R/W) │ │(R) │ │ │ │(R only)│ │(R only)│ │
│ └────┬───┘ └────┬───┘ │ │ └────┬───┘ └────┬───┘ │
│ └─────┬────┘ │ │ └─────┬────┘ │
│ ▼ │ Async │ ▼ │
│ ┌──────────────────┐ │ <1 sec │ ┌──────────────────┐ │
│ │ Cluster Storage │───┼───────────►│ │ Cluster Storage │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
└─────────────────────────┘ └─────────────────────────┘
- Cross-region disaster recovery
- Replication lag typically < 1 second
- Secondary region is read-only until promoted
- Up to 5 secondary regions
Comparison
| Aspect | Aurora Cluster | Aurora Global Database |
|---|---|---|
| Scope | Single region | Multiple regions |
| Write location | Primary instance | Primary region only |
| Replication | Shared storage (instant) | Cross-region async (<1 sec) |
| Failover | ~30 seconds (within region) | Minutes (cross-region) |
| Use case | HA within region | DR + global reads |
See AWS RDS, Aurora, and EBS Storage Basics for details.
Auto Scaling Group (ASG)
Maintains a fleet of EC2 instances: launches when needed, terminates when not, replaces unhealthy ones.
Core Concept
Capacity Settings:
Minimum: 2 (never go below)
Desired: 4 (try to maintain)
Maximum: 10 (never exceed)
Components Relationship
ALB ──► Target Group ◄─── ASG registers/deregisters instances automatically
│ │
▼ │
┌─────────┐ │
│ EC2-1 │ ◄─────────────┤ ASG launches
│ EC2-2 │ ◄─────────────┤
│ EC2-3 │ ◄─────────────┘
└─────────┘
- Launch Template: Defines instance config (AMI, instance type, SG, user data)
- Target Group: List of instances ALB sends traffic to
- ASG: Creates/terminates instances, registers them to Target Group
Scaling Types
| Type | How It Works |
|---|---|
| Manual | You change desired capacity |
| Dynamic | CloudWatch alarm triggers scaling policy |
| Scheduled | Time-based (e.g., scale up at 9 AM) |
| Predictive | ML-based, scales proactively based on patterns |
Dynamic Scaling Policies
| Policy | Description |
|---|---|
| Target Tracking | “Keep CPU at 50%” - ASG figures out instance count |
| Step Scaling | Different actions at different thresholds |
| Simple Scaling | Single action when alarm triggers |
Useful Metrics for Scaling
| Workload | Recommended Metric |
|---|---|
| Web app behind ALB | RequestCountPerTarget (ALB) |
| API servers | CPUUtilization (EC2) |
| Queue workers | ApproximateNumberOfMessages (SQS) |
| Memory-intensive | mem_used_percent (requires CloudWatch Agent) |
Note: Memory and disk space metrics require CloudWatch Agent because hypervisor cannot see inside VM. See EC2 CloudWatch Metrics - Why Some Require Agent for details.
Health Checks
ASG checks some status like EC2 status and ALB health check status (ALB marks instances as Unhealthy).
| Type | Source | Use Case |
|---|---|---|
| EC2 | EC2 status checks | Basic - is instance running? |
| ELB | ALB health check | App-level - is app responding? |
Unhealthy instance → ASG terminates → launches replacement
Grace Period: Time after launch before health checks start (default 300s)
Key Features
| Feature | Purpose |
|---|---|
| AZ Balancing | Distributes instances evenly across AZs |
| Termination Policies | Controls which instance to remove when scaling in |
| Lifecycle Hooks | Run custom actions during launch/terminate |
| Instance Refresh | Rolling update all instances (e.g., new AMI) |
| Warm Pools | Pre-initialized instances for faster scaling |
| Mixed Instances | Multiple instance types + Spot/On-Demand mix |
| Cooldown | Prevents rapid scale in/out oscillation |
Mixed Instances Policy
Configured on ASG (not Launch Template). Allows multiple instance types and purchase options.
Instance Types: [t3.medium, t3.large, t3a.medium]
Purchase Options:
On-Demand base: 2 instances
Spot percentage: 80%
Spot vs On-Demand
| Aspect | On-Demand | Spot |
|---|---|---|
| Price | Full price | 60-90% discount |
| Availability | Always | When spare capacity exists |
| Interruption | Never | Can be interrupted (2-min warning) |
| Use case | Critical workloads | Batch jobs, fault-tolerant apps |
ECS Task
A Task is a running instance of your containers - the actual process running based on a Task Definition.
Task vs Task Definition
Task Definition (blueprint): Task (running instance):
┌─────────────────────────┐ ┌─────────────────────────┐
│ "Use nginx image" │ │ nginx container running │
│ "Give it 512MB RAM" │ ──run──► │ Using 512MB RAM │
│ "Open port 80" │ │ Listening on port 80 │
│ "Set ENV=production" │ │ ENV=production set │
└─────────────────────────┘ └─────────────────────────┘
(JSON config) (actual process)
Two Ways to Run Tasks
| Method | Behavior | Use Case |
|---|---|---|
| Service | Keeps desired count always running | Web servers, APIs |
| Standalone Task | Run once, then stop | Batch jobs, migrations |
ECS Service (desired: 3 tasks):
┌─────────────────────────────────────────────┐
│ Task 1 (running) ✓ │
│ Task 2 (running) ✓ │
│ Task 3 (running) ✓ │
│ │
│ If Task 2 crashes → Service starts new one │
└─────────────────────────────────────────────┘
What’s Inside a Task
A task can have multiple containers that share network, storage, and lifecycle.
Task
┌─────────────────────────────────────────────┐
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Container 1 │ │ Container 2 │ │
│ │ (nginx) │◄──►│ (php-fpm) │ │
│ │ port 80 │ │ port 9000 │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ └──── localhost ───┘ │
│ │
│ Shared: IP address, volumes, lifecycle │
│ Task IP: 10.0.1.50 │
└─────────────────────────────────────────────┘
Task Placement: One Task = One Instance
A task runs on exactly one EC2 instance. Cannot span multiple instances.
✓ Correct:
┌─────────────────┐ ┌─────────────────┐
│ EC2 Instance A │ │ EC2 Instance B │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Task 1 │ │ │ │ Task 2 │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘
✗ Not possible (task cannot span instances):
┌─────────────────┐ ┌─────────────────┐
│ EC2 Instance A │ │ EC2 Instance B │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Task 1 │◄┼────┼►│ Task 1 │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘
To scale, run multiple tasks across instances with a load balancer.
ECS on EC2 vs Fargate
| ECS on EC2 | Fargate | |
|---|---|---|
| Infrastructure | You manage EC2 instances | AWS manages |
| Kernel sharing | Tasks share EC2’s OS kernel | Each task has own micro-VM |
| Isolation | Process-level (namespaces) | Hardware-level (hypervisor) |
ECS on EC2:
┌─────────────────────────────────────────────────────────┐
│ EC2 Instance (Guest OS) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Docker Engine │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Task 1 │ │ Task 2 │ ← Share OS │ │
│ │ │ (container) │ │ (container) │ kernel │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Fargate (using Firecracker micro-VMs):
┌─────────────────────────────────────────────────────────┐
│ AWS-managed infrastructure │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ micro-VM 1 │ │ micro-VM 2 │ │
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
│ │ │ Minimal Linux │ │ │ │ Minimal Linux │ │ │
│ │ │ Kernel │ │ │ │ Kernel │ │ │
│ │ ├───────────────┤ │ │ ├───────────────┤ │ │
│ │ │ Container │ │ │ │ Container │ │ │
│ │ └───────────────┘ │ │ └───────────────┘ │ │
│ └───────────────────┘ └───────────────────┘ │
│ ↑ ↑ │
│ └── Separate kernels, fully isolated ─────────┘
└─────────────────────────────────────────────────────────┘
Fargate uses micro-VMs for multi-tenant security - your task can’t access other customers’ tasks.
Task Lifecycle
PROVISIONING → PENDING → RUNNING → STOPPED
│ │ │ │
│ │ │ └─ Container exited or stopped
│ │ └─ Containers running
│ └─ Waiting for resources
└─ Preparing to launch
EventBridge Task State Detection
ECS sends task state change events to EventBridge.
{
"source": "aws.ecs",
"detail-type": "ECS Task State Change",
"detail": {
"lastStatus": "STOPPED",
"stoppedReason": "Essential container in task exited",
"containers": [{ "name": "web", "exitCode": 1 }]
}
}
EventBridge rule pattern:
{
"source": ["aws.ecs"],
"detail-type": ["ECS Task State Change"],
"detail": {
"lastStatus": ["STOPPED"]
}
}
Kinesis Data Streams
Collect and process large amounts of real-time data (logs, events, clicks, IoT data).
Producers Kinesis Data Stream Consumers
┌─────────┐ ┌─────────────────────┐ ┌─────────┐
│ App 1 │────► │ │ ────►│ Lambda │
│ App 2 │────► records │ Stream │ records ─►│ EC2 App │
│ IoT │────► │ │ ────►│ Firehose│
└─────────┘ └─────────────────────┘ └─────────┘
Data stays in stream for 24 hours (default) up to 365 days. Multiple consumers can read same data.
Shard
A shard is a unit of capacity. More shards = more throughput.
Kinesis Data Stream (3 shards)
┌─────────────────────────────────────────────────────┐
│ ┌─────────────────┐ Shard 1: 1 MB/s in, 2 MB/s out│
│ │ Shard 1 │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ Shard 2: 1 MB/s in, 2 MB/s out│
│ │ Shard 2 │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ Shard 3: 1 MB/s in, 2 MB/s out│
│ │ Shard 3 │ │
│ └─────────────────┘ │
│ Total: 3 MB/s in, 6 MB/s out │
└─────────────────────────────────────────────────────┘
Per shard limits:
| Direction | Limit |
|---|---|
| Write (in) | 1 MB/sec or 1,000 records/sec |
| Read (out) | 2 MB/sec |
Partition key determines which shard receives each record (hash-based).
Record with partition_key="user123"
↓
hash("user123") → Falls into Shard 2's range
↓
Record stored in Shard 2
Same partition key → same shard → ordered processing for that key.
Enhanced Fan-Out
Gives each consumer dedicated throughput instead of sharing.
Standard (shared):
Shard ──────────────────────────────────────────────
2 MB/sec total shared
┌──────────┼──────────┐
▼ ▼ ▼
Consumer A Consumer B Consumer C
~0.67 MB/s ~0.67 MB/s ~0.67 MB/s
Enhanced Fan-Out (dedicated):
Shard ──────────────────────────────────────────────
2 MB/sec each dedicated
┌──────────┼──────────┐
▼ ▼ ▼
Consumer A Consumer B Consumer C
2 MB/sec 2 MB/sec 2 MB/sec
| Standard | Enhanced Fan-Out | |
|---|---|---|
| Throughput per shard | 2 MB/sec shared | 2 MB/sec per consumer |
| Delivery | Pull (GetRecords) | Push (SubscribeToShard) |
| Latency | ~200ms | ~70ms |
| Consumer registration | Not needed | Required |
| ARN used | Stream ARN | Consumer ARN |
Standard mode: No consumer registration needed. GetRecords API and Lambda use stream ARN directly.
Enhanced Fan-Out: Must register consumer first, then use consumer ARN.
# Standard - no registration, use stream ARN
aws lambda create-event-source-mapping \
--function-name my-function \
--event-source-arn arn:aws:kinesis:...:stream/my-stream \
--starting-position LATEST
# Enhanced Fan-Out - register first, then use consumer ARN
aws kinesis register-stream-consumer \
--stream-arn arn:aws:kinesis:us-east-1:123456789:stream/my-stream \
--consumer-name my-consumer
aws lambda create-event-source-mapping \
--function-name my-function \
--event-source-arn arn:aws:kinesis:...:stream/my-stream/consumer/my-consumer:123 \
--starting-position LATEST
Batch Size and Batching Window
Control how records are delivered to Lambda.
aws lambda create-event-source-mapping \
--function-name my-function \
--event-source-arn arn:aws:kinesis:...:stream/my-stream \
--batch-size 100 \
--maximum-batching-window-in-seconds 30 \
--starting-position LATEST
Lambda invokes when EITHER condition is met:
batch-sizerecords collected (default: 100, max: 10,000)maximum-batching-window-in-secondspassed (default: 0, max: 300)
| Records in 30 sec | What happens |
|---|---|
| 150 records | Invokes at 100 records (batch size hit first) |
| 50 records | Invokes at 30 seconds with 50 records (timeout hit first) |
| 0 records | No invocation |
Lambda Concurrency and Processing Settings
Key concept: 1 invocation = 1 Lambda instance. Multiple concurrent invocations = multiple instances.
Concurrency Quota: 1000 per region (default), which means 1000 Lambda instances at the same time.
Reserved Concurrency
Guarantee and limit concurrency for a specific function.
Without reserved concurrency:
Function A spike could starve other functions
With reserved concurrency:
Function A: reserved 100 (guaranteed, max 100)
Function B: reserved 200 (guaranteed, max 200)
Function C: unreserved (uses remaining 700)
Set to 0 = function disabled.
ParallelizationFactor
Process one Kinesis/DynamoDB shard with multiple Lambda instances in parallel.
ParallelizationFactor = 1 (default):
Shard 1 ──► Instance 1
Shard 2 ──► Instance 2
Total instances = 2
ParallelizationFactor = 3:
Shard 1 ──► Instance 1, Instance 2, Instance 3
Shard 2 ──► Instance 4, Instance 5, Instance 6
Total instances = shards × factor = 2 × 3 = 6
Max: 10
ReportBatchItemFailures
Retry only failed records, not entire batch.
Without ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry ALL [1,2,3,4,5]
With ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry from 3: [3,4,5]
How it works:
┌─────────────────────────────────────────────────────────────────┐
│ Lambda Service (AWS managed) │
│ │
│ 1. Pulls records from Kinesis shard │
│ 2. Invokes your function with batch of records │
│ 3. Reads your function's return value │
│ 4. Retries only failed records based on your response │
└─────────────────────────────────────────────────────────────────┘
│ ▲
│ event.Records │ return {"batchItemFailures": [...]}
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ Your Lambda Function Code │
│ - Receives records (doesn't pull from Kinesis) │
│ - Processes them │
│ - Returns which ones failed │
└─────────────────────────────────────────────────────────────────┘
Enable:
aws lambda update-event-source-mapping \
--uuid <mapping-uuid> \
--function-response-types "ReportBatchItemFailures"
Lambda response:
def handler(event, context):
failures = []
for record in event['Records']: # records from Kinesis
try:
process(record)
except:
failures.append({"itemIdentifier": record['kinesis']['sequenceNumber']})
return {"batchItemFailures": failures} # tell Lambda which failed
Kinesis Data Firehose
Fully managed delivery service. No consumer code needed.
Producers ──► Firehose ──► S3 / Redshift / OpenSearch / Splunk / HTTP
When to Use Firehose vs Data Streams
| Data Streams | Firehose | |
|---|---|---|
| Purpose | Real-time processing | Delivery to storage |
| You write | Consumer code | Nothing |
| Latency | Milliseconds | 60+ seconds (buffered) |
| Retention | 24h - 365 days | None (delivers immediately) |
Batching
Firehose buffers records and delivers as batched files, not individual records.
Without Firehose: With Firehose:
Record 1 → file1.json Record 1 ─┐
Record 2 → file2.json Record 2 ─┼─► Buffer ──► one-big-file.json
Record 3 → file3.json Record 3 ─┘
(millions of tiny files) (fewer, larger files)
Buffer Settings
| Setting | Range | Behavior |
|---|---|---|
| Buffer size | 1-128 MB | Flush when size reached |
| Buffer interval | 60-900 seconds | Flush when time elapsed |
Whichever comes first triggers delivery.
Format Conversion
Firehose can convert JSON to columnar formats automatically:
JSON records ──► Firehose ──► Parquet/ORC files in S3
- Better for Athena/Redshift queries (faster, cheaper)
- Requires schema (from AWS Glue Data Catalog)
Optional Lambda Transform
Transform records before delivery:
Producers ──► Firehose ──► Lambda (transform) ──► S3
│
└── Add fields, filter, convert format
def handler(event, context):
output = []
for record in event['records']:
payload = base64.b64decode(record['data']).decode('utf-8')
# Transform the data
transformed = payload.upper()
output.append({
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(transformed.encode('utf-8')).decode('utf-8')
})
return {'records': output}
ECR Image Scanning
ECR scanning analyzes container images for security vulnerabilities (CVEs - Common Vulnerabilities and Exposures).
Two Scanning Options
| Basic Scanning | Enhanced Scanning | |
|---|---|---|
| Engine | Clair (open source) | Amazon Inspector |
| Scope | OS packages only | OS packages + application dependencies |
| When | On-push or manual | Continuous (auto re-scan on new CVEs) |
| Cost | Free | Pay per image scanned |
Basic Scanning
Uses Clair scanner. Only scans OS-level packages (apt, yum).
Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip) │ ← NOT scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...) │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04) │ ← Scanned
└─────────────────────────────────────┘
- Triggered on image push or manual API call
- Results are static until next scan
- New CVE discovered tomorrow → won’t know until re-scan
Enhanced Scanning
Uses Amazon Inspector. Scans OS packages AND application dependencies.
Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip) │ ← Scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...) │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04) │ ← Scanned
└─────────────────────────────────────┘
- Continuous monitoring - auto re-scans when new CVEs published
- Integrates with EventBridge for alerts
- Supports: Java (Maven), JavaScript (npm), Python (pip), Go, .NET
Key Terms
- CVE: Publicly known vulnerability with unique ID (e.g., CVE-2021-44228 = Log4Shell)
- Clair: Open-source container vulnerability scanner
- Amazon Inspector: AWS service for automated vulnerability management
Building Container Images
Two main AWS services for building container images.
CodeBuild
General-purpose build service. Most common for container CI/CD.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ CodeCommit │────►│ CodeBuild │────►│ ECR │
│ (source) │ │ docker build │ │ (registry) │
│ │ │ docker push │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
buildspec.yml example:
version: 0.2
phases:
pre_build:
commands:
- aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
build:
commands:
- docker build -t $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION .
post_build:
commands:
- docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
- Full control over build process
- Integrates with CodePipeline
- Can run tests, multi-stage builds, any custom logic
EC2 Image Builder
Automated image creation service. Can build AMIs or container images.
┌─────────────────────────────────────────────────────────────┐
│ EC2 Image Builder Pipeline │
│ │
│ Recipe ──► Build ──► Test ──► Distribute to ECR │
└─────────────────────────────────────────────────────────────┘
Container Recipe options:
- Use components (no Dockerfile) - Image Builder applies changes to base image
- Provide Dockerfile from S3
Key terms:
- Recipe: Base image + components or Dockerfile
- Component: Reusable build/test action (install packages, configure, etc.)
- Pipeline: Automated workflow with schedule
Console steps for container image:
- Create Container Recipe - base image + components or Dockerfile S3 path + target ECR repo
- Create Infrastructure Configuration - instance type, IAM role, VPC/subnet for build
- Create Distribution Settings - target ECR repositories (can be cross-region/cross-account)
- Create Pipeline - link recipe + infrastructure + distribution + schedule
- Run Pipeline - builds and pushes to ECR
When to Use Which
| Use Case | Better Choice |
|---|---|
| CI/CD triggered by code commits | CodeBuild |
| Scheduled golden image builds | EC2 Image Builder |
| Need component library (CIS benchmarks, etc.) | EC2 Image Builder |
| Custom build logic, tests, multi-stage | CodeBuild |
| Part of CodePipeline | CodeBuild |
AWS App Runner
Fully managed service to run web apps/APIs. You provide code or container → App Runner handles everything.
You provide: App Runner handles:
┌─────────────────┐ ┌─────────────────────────────┐
│ Source code │ │ Build │
│ (GitHub repo) │───────────►│ Deploy │
│ OR │ │ Scale (auto, including to 0)│
│ Container image │ │ Load balancing │
│ (ECR) │ │ HTTPS/TLS certificate │
└─────────────────┘ │ Health checks │
└─────────────────────────────┘
│
▼
https://abc123.awsapprunner.com
Two Source Types
| Source | How It Works |
|---|---|
| Source code (GitHub) | App Runner builds container automatically |
| Container image (ECR) | App Runner pulls and runs directly |
Comparison with Other Compute
| App Runner | ECS Fargate | Lambda | |
|---|---|---|---|
| You manage | Almost nothing | Task definitions, services, ALB | Function code |
| Scaling | Automatic | You configure | Automatic |
| Min instances | Can scale to 0 | Min 1 task | N/A (event-driven) |
| Use case | Simple web apps | Complex container workloads | Event processing |
| Pricing | Per vCPU/memory hour | Per vCPU/memory hour | Per request + duration |
Key Features
- Auto scaling: Based on concurrent requests, can scale to zero
- Auto deployments: Trigger on ECR push or GitHub commit
- VPC Connector: Access private resources (RDS, ElastiCache) in VPC
- Custom domain: Bring your own domain with automatic TLS
When to Use App Runner
- Simple web apps, APIs, microservices
- Want zero infrastructure management
- Don’t need ECS features (service mesh, complex networking)
- Acceptable to use App Runner’s opinionated defaults
AWS Backup
Centralized service to manage backups across multiple AWS services from one place.
Without AWS Backup: With AWS Backup:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────────────────┐
│ EC2 │ │ RDS │ │ EFS │ │ AWS Backup │
│ snapshot│ │ snapshot│ │ backup │ │ One backup plan for all │
│ config │ │ config │ │ config │ │ ┌─────┬─────┬─────┐ │
└─────────┘ └─────────┘ └─────────┘ │ │ EC2 │ RDS │ EFS │ │
↓ ↓ ↓ │ └─────┴─────┴─────┘ │
Manage each separately └─────────────────────────────┘
Supported: EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3, etc.
Core Concepts
| Concept | What It Is |
|---|---|
| Backup Plan | When and how to backup (schedule, retention, copy rules) |
| Resource Assignment | What to backup (by resource ID or tags) |
| Backup Vault | Where backups are stored (container for recovery points) |
| Recovery Point | The actual backup data (snapshot, AMI, etc.) |
Backup Vault Features
| Feature | Purpose |
|---|---|
| Encryption | All backups encrypted with KMS key |
| Access Policy | Control who can backup/restore/delete |
| Vault Lock | WORM - prevent deletion even by root (compliance) |
Cross-Account Backup Copy
Copy recovery points to another AWS account for disaster recovery.
Source Account (111) Destination Account (222)
┌─────────────────────┐ ┌─────────────────────┐
│ Backup Plan │ │ Backup Vault │
│ ┌───────────────┐ │ │ │
│ │ Copy Rule: │ │ copy │ Access Policy: │
│ │ Dest Vault ARN│──┼───────────────►│ Allow 111 to │
│ └───────────────┘ │ │ CopyIntoBackupVault│
│ │ │ │
│ Source Vault │ │ Recovery Point │
│ (30 days retention)│ │ (90 days retention)│
└─────────────────────┘ └─────────────────────┘
Setup required:
- Source account: Backup plan with copy rule pointing to destination vault ARN
- Destination account: Vault access policy allowing
backup:CopyIntoBackupVault
Cross-Account KMS Encryption
Behavior depends on whether the service supports “independent encryption” by AWS Backup.
Services WITH independent encryption (DynamoDB advanced, EFS):
- AWS Backup handles encryption at vault level
- No KMS key sharing needed
Services WITHOUT independent encryption (RDS, EC2/EBS):
- Backup encrypted with data source’s KMS key (not vault key)
- Destination account’s
AWSServiceRoleForBackupperforms the copy - Source KMS key must grant
kms:Decryptto destination account’s service-linked role
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::222222222222:role/aws-service-role/backup.amazonaws.com/AWSServiceRoleForBackup"
},
"Action": ["kms:Decrypt", "kms:CreateGrant"],
"Resource": "*"
}
Destination vault re-encrypts with its own KMS key → each account controls its own copy independently.
IAM Roles Anywhere
Lets workloads outside AWS (on-premises, other clouds) get temporary AWS credentials using X.509 certificates.
Problem It Solves
| Method | Issue |
|---|---|
| IAM User access keys | Long-term, can leak, manual rotation |
| EC2 Instance Profile | Only works on EC2 |
IAM Roles Anywhere = temporary credentials for external workloads.
How It Works
On-Premises Server AWS
┌─────────────────────────┐ ┌─────────────────────────────┐
│ │ │ IAM Roles Anywhere │
│ X.509 Certificate │ 1. Present │ │
│ (issued by your CA) │─────cert───────►│ 2. Validate cert against │
│ │ │ Trust Anchor (your CA) │
│ │◄──temp creds────│ 3. Return temporary │
│ AWS CLI / SDK │ │ credentials for Role │
└─────────────────────────┘ └─────────────────────────────┘
Key Components
| Component | What It Is |
|---|---|
| Trust Anchor | Your CA that AWS trusts (own CA or AWS Private CA) |
| Profile | Links Trust Anchor to IAM Role(s) |
| Role | IAM role with trust policy for rolesanywhere.amazonaws.com |
| X.509 Certificate | Installed on server, issued by your CA |
Credential Helper Usage
# Direct command
aws_signing_helper credential-process \
--certificate /path/to/cert.pem \
--private-key /path/to/key.pem \
--trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
--profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
--role-arn arn:aws:iam::111111111111:role/MyRole
# ~/.aws/config
[profile onprem]
credential_process = aws_signing_helper credential-process \
--certificate /path/to/cert.pem \
--private-key /path/to/key.pem \
--trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
--profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
--role-arn arn:aws:iam::111111111111:role/MyRole
Then use: aws s3 ls --profile onprem
EFS (Elastic File System)
Managed NFS file system that multiple EC2 instances can access simultaneously.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ EC2 (AZ-a) │ │ EC2 (AZ-b) │ │ EC2 (AZ-c) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┼───────────────────┘
│ NFS protocol (port 2049)
▼
┌─────────────────────────┐
│ EFS │
│ /shared-files/ │
└─────────────────────────┘
- Shared storage: Multiple instances read/write same files
- Auto-scaling: Grows/shrinks automatically
- Protocol: NFS v4 (Linux only)
- Mount:
sudo mount -t nfs4 fs-xxx.efs.region.amazonaws.com:/ /mnt/efs
On-Premises Access
On-prem servers can mount EFS over Direct Connect or VPN.
On-Premises ──── Direct Connect/VPN ──── VPC ──── EFS
FSx (Managed File Systems)
Managed file systems for specific use cases.
| FSx Type | Protocol | Use Case |
|---|---|---|
| FSx for Windows File Server | SMB | Windows workloads, Active Directory |
| FSx for Lustre | Lustre | High-performance computing, ML |
| FSx for NetApp ONTAP | NFS, SMB, iSCSI | Enterprise, multi-protocol |
| FSx for OpenZFS | NFS | Linux workloads needing ZFS features |
EFS vs FSx:
- EFS = Simple NFS for Linux
- FSx = Specialized file systems (Windows, HPC, enterprise)
Site-to-Site VPN
Encrypted tunnel over public internet connecting on-premises to AWS VPC.
On-Premises AWS
┌─────────────────┐ ┌─────────────────┐
│ Your Router │ Public Internet │ Virtual Private│
│ (Customer GW) │───── Encrypted Tunnel ────│ Gateway (VGW) │
│ 10.0.0.0/16 │ │ 172.31.0.0/16 │
└─────────────────┘ └─────────────────┘
Components
| Component | What It Is |
|---|---|
| Customer Gateway (CGW) | AWS resource representing your on-prem router |
| Virtual Private Gateway (VGW) | VPN endpoint attached to one VPC |
| VPN Connection | Links CGW ↔ VGW, creates two tunnels for redundancy |
How VPN Works (Encapsulation)
VPN wraps original packet inside encrypted outer packet. Original private IPs preserved.
Original: src=10.0.1.50 dst=172.31.1.100
After VPN encapsulation:
┌─────────────────────────────────────────────────────────────┐
│ Outer: src=203.0.113.50 dst=52.x.x.x (AWS VPN endpoint) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ENCRYPTED: src=10.0.1.50 dst=172.31.1.100 (preserved) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Routing Required
Both sides need routes pointing to VPN:
On-prem router: 172.31.0.0/16 → VPN tunnel
VPC route table: 10.0.0.0/16 → vgw-xxxxx
VPN vs Direct Connect
| Aspect | VPN | Direct Connect |
|---|---|---|
| Connection | Over public internet | Dedicated physical cable |
| Setup time | Minutes | Weeks to months |
| Cost | Low | High |
| Bandwidth | Up to ~1.25 Gbps | 1-100 Gbps |
| Latency | Variable | Consistent |
| Encryption | Built-in (IPsec) | Not by default |
Transit Gateway (TGW)
Hub connecting multiple VPCs and on-premises networks.
┌─────┐ ┌─────┐ ┌─────┐
│VPC-A│ │VPC-B│ │VPC-C│
└──┬──┘ └──┬──┘ └──┬──┘
└──────┼───────┘
│
┌─────────▼─────────┐
│ Transit Gateway │
└─────────┬─────────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
VPN to On-Prem Direct Connect
- Central hub - add new VPCs easily
- VPN/Direct Connect connects once to TGW, reaches all VPCs
- Route tables control which networks can communicate
S3 Event Notifications
Triggers actions when events happen in S3 bucket.
S3 Bucket ──► Event Notification ──► Lambda / SQS / SNS / EventBridge
Event Types
| Category | Examples |
|---|---|
| Object created | s3:ObjectCreated:Put, s3:ObjectCreated:Copy |
| Object removed | s3:ObjectRemoved:Delete |
| Replication | s3:Replication:OperationFailedReplication |
| Lifecycle | s3:LifecycleExpiration:*, s3:LifecycleTransition |
| Restore | s3:ObjectRestore:Completed |
S3 Notifications vs EventBridge
| S3 Event Notifications | S3 → EventBridge | |
|---|---|---|
| Destinations | Lambda, SQS, SNS only | 20+ AWS services |
| Filtering | Prefix/suffix only | Advanced (metadata, size) |
S3 Batch Operations
Run operations on billions of objects at once.
Manifest (list of objects) ──► S3 Batch Job ──► Operation on all objects
Operations
| Operation | Use Case |
|---|---|
| Copy | Migrate objects to another bucket |
| Invoke Lambda | Custom processing per object |
| Replace tags | Bulk update tags |
| Restore from Glacier | Bulk restore archived objects |
| Delete | Bulk delete |
DAX (DynamoDB Accelerator)
In-memory cache for DynamoDB. Microsecond latency for reads.
Application
│
│ Same DynamoDB API
▼
┌─────────────┐
│ DAX │ ← Microsecond (cache hit)
│ Cluster │
└──────┬──────┘
│ Cache miss
▼
┌─────────────┐
│ DynamoDB │ ← Millisecond
└─────────────┘
- API-compatible with DynamoDB (just change endpoint)
- Use case: Read-heavy workloads needing microsecond latency
RDS Proxy
Connection pooler for RDS/Aurora. Solves connection exhaustion.
Lambda (100s concurrent)
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────┐
│ RDS Proxy │ ← Pools connections
└────────┬────────┘
│ Few persistent connections
▼
┌─────────────────┐
│ RDS / Aurora │
└─────────────────┘
- Problem: Lambda spawns many connections, DB has limits
- Solution: Proxy reuses connections from pool
- Bonus: Faster failover for Aurora
DAX vs RDS Proxy
| DAX | RDS Proxy | |
|---|---|---|
| For | DynamoDB | RDS / Aurora |
| Purpose | Caching (latency) | Connection pooling |
AWS Service Catalog
Catalog of approved, pre-configured AWS resources for users to deploy.
Admin creates Products ──► Users see approved products only ──► Launch
(CloudFormation templates) (from shared Portfolios)
Key Concepts
| Term | What It Is |
|---|---|
| Product | CloudFormation template packaged for deployment |
| Portfolio | Collection of products, shared with users/accounts |
| Constraint | Rules (allowed parameters, launch role) |
Restrictions
| What | How |
|---|---|
| Allowed regions | Portfolio exists only in allowed regions |
| Allowed parameters | Template Constraint or AllowedValues in template |
| Permissions | Launch Constraint (IAM role used to deploy) |
Template Constraint Example
{
"Rules": {
"InstanceTypeRule": {
"Assertions": [{
"Assert": {
"Fn::Contains": [["t3.micro", "t3.small"], {"Ref": "InstanceType"}]
},
"AssertDescription": "Only t3.micro or t3.small allowed"
}]
}
}
}
CloudFormation Custom Resource
Run your own Lambda code during stack operations. For things CloudFormation doesn’t natively support.
CloudFormation ──► Your Lambda ──► Does custom work ──► Reports back
Syntax
Resources:
MyCustomResource:
Type: Custom::AnyNameYouWant # "Custom::" prefix required
Properties:
ServiceToken: !GetAtt MyLambda.Arn # Required: Lambda ARN
CustomParam1: value1 # Your custom inputs
CustomParam2: value2
Lambda Receives
{
"RequestType": "Create",
"ResourceProperties": {
"CustomParam1": "value1",
"CustomParam2": "value2"
},
"ResponseURL": "https://..."
}
Lambda Must
- Check
RequestType(Create, Update, Delete) - Do the work
- Send success/failure to
ResponseURL
Use Cases
- Create resources in other regions
- Call external APIs during deployment
- Complex logic CloudFormation can’t express
Kubernetes Namespace
Virtual cluster division within a Kubernetes cluster. Groups and isolates resources.
┌─────────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Namespace: default │ │
│ │ Deployment: web Service: web-service │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Namespace: production │ │
│ │ Deployment: api ConfigMap: prod-config │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Namespace: kube-system (Kubernetes internal) │ │
│ │ ConfigMap: aws-auth DaemonSet: kube-proxy │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Use Cases
| Purpose | Example |
|---|---|
| Environment separation | dev, staging, production namespaces |
| Team separation | team-a, team-b namespaces |
| Access control | Team A can only access team-a namespace |
DNS with Namespaces
Service DNS: <service-name>.<namespace>.svc.cluster.local
Examples:
- api-service.default.svc.cluster.local
- api-service.production.svc.cluster.local
Container Insights
CloudWatch feature that collects metrics and logs from containerized applications (ECS, EKS).
How It Works (EKS)
┌─────────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │CloudWatch│ │ │ │CloudWatch│ │ │ │CloudWatch│ │ │
│ │ │Agent │ │ │ │Agent │ │ │ │Agent │ │ │
│ │ │(DaemonSet)│ │ │ │(DaemonSet)│ │ │ │(DaemonSet)│ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ CloudWatch Metrics │
│ (namespace: ContainerInsights) │
└─────────────────────────────────────────────────────────────────┘
Key Metrics
| Metric | What it measures |
|---|---|
| pod_memory_utilization | % of memory limit used |
| pod_cpu_utilization | % of CPU limit used |
| pod_memory_working_set | Actual bytes in use |
Dimensions
Filter metrics by:
- ClusterName
- Namespace (Kubernetes namespace)
- Service (Kubernetes Service name)
- PodName
- NodeName
AWS Glue Crawler
Automatically scans data sources and creates table definitions in Glue Data Catalog.
┌─────────────────────────────────────────────────────────────────┐
│ S3 Bucket (/data/) │
│ sales.csv │
│ orders.json │
│ │ │
│ ▼ │
│ ┌─────────────┐ Detects: │
│ │ Crawler │ - File format (CSV, JSON, Parquet) │
│ │ │ - Column names │
│ │ │ - Data types │
│ │ │ - Partitions (year=2024/month=01/) │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Glue Data Catalog │ │
│ │ Database: my_database │ │
│ │ ├── Table: sales (id, product, amount) │ │
│ │ └── Table: orders (order_id, customer) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Now queryable with Athena: │
│ SELECT * FROM my_database.sales WHERE amount > 100 │
└─────────────────────────────────────────────────────────────────┘
AWS Glue ETL
Serverless data transformation jobs. ETL = Extract, Transform, Load.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Extract │ │ Transform │ │ Load │
│ │ │ │ │ │
│ Read from │ → │ Clean, │ → │ Write to │
│ S3, RDS │ │ filter, │ │ S3, Redshift│
│ │ │ join │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Key Points
| Aspect | Description |
|---|---|
| Serverless | Pay per second of job runtime |
| Engine | Apache Spark |
| Write in | Python (PySpark) or Scala |
| Triggers | On-demand, scheduled, or event-based |
Glue Components Together
S3 (raw) ──► Crawler ──► Data Catalog ──► ETL Job ──► S3 (clean)
│
▼
Athena (query)
SigV4 (Signature Version 4)
AWS’s method for authenticating API requests. Every AWS API call must be signed.
What It Proves
- You have valid AWS credentials
- Request hasn’t been modified in transit
- Request is recent (not replay attack)
The 4 Steps
1. Create Canonical Request
- Standardize HTTP method, path, headers, body hash
2. Create String to Sign
- Algorithm + timestamp + scope + hash of step 1
3. Calculate Signing Key
- Chain HMAC-SHA256 from Secret Key → date → region → service
4. Calculate Signature
- HMAC(signing_key, string_to_sign)
Result: Authorization Header
Authorization: AWS4-HMAC-SHA256
Credential=AKIAIOSFODNN7EXAMPLE/20241229/us-east-1/s3/aws4_request,
SignedHeaders=host;x-amz-date,
Signature=abc123def456...
You don’t do this manually. AWS SDKs and CLI handle it automatically.
CodeArtifact Domain
Container that groups multiple repositories. Provides shared storage, permissions, encryption.
┌─────────────────────────────────────────────────────────────────┐
│ CodeArtifact Domain │
│ (name: my-company) │
│ │
│ Shared: KMS key, IAM policies, deduplication │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Repository: │ │ Repository: │ │ Repository: │ │
│ │ npm-prod │ │ npm-dev │ │ python-internal │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Upstream Repository
Repository that another repository pulls from when package not found locally.
Developer: npm install lodash
│
▼
my-npm-repo (not found) ──► npm-public-proxy (not found) ──► npmjs.org
│ │ │
│◄─────────────────────────────┼────────────────────────────┘
│ Package cached at each level
Benefits
- Single endpoint for internal + public packages
- Caching (faster installs, works if npmjs down)
- Audit all package downloads
FSx Types Comparison
| FSx Type | Protocol | Best For |
|---|---|---|
| Windows File Server | SMB | Windows apps, Active Directory |
| Lustre | Lustre | HPC, ML, high-throughput |
| NetApp ONTAP | NFS, SMB, iSCSI | Enterprise, multi-protocol |
| OpenZFS | NFS | Linux workloads, snapshots |
Key Differences
| Windows | Lustre | NetApp ONTAP | OpenZFS | |
|---|---|---|---|---|
| OS support | Windows | Linux only | All | Linux, macOS |
| AD required | Yes | No | Optional | No |
| S3 integration | No | Yes (native) | No | No |
| Multi-protocol | No | No | Yes | No |
| Snapshots | Shadow copies | No | Yes | Yes |
| Multi-AZ | Yes | No | Yes | No |
EFS vs FSx for OpenZFS
Both are NFS for Linux, different design goals.
| Aspect | EFS | FSx for OpenZFS |
|---|---|---|
| Capacity | Auto-scales | You provision |
| Performance | Scales with size | Up to 1M IOPS |
| Latency | Milliseconds | Sub-millisecond |
| Snapshots | No | Yes (instant) |
| Clones | No | Yes (instant) |
| Multi-AZ | Yes | No |
| Best for | Shared storage, CMS | Databases, analytics |
AWS Storage Gateway
Hybrid storage connecting on-premises to AWS cloud storage.
┌─────────────────────────────────────────────────────────────────┐
│ On-Premises │
│ │
│ Application ──NFS/SMB/iSCSI──► Storage Gateway ──► AWS (S3, │
│ (VM or hardware) EBS, │
│ Local cache Glacier) │
└─────────────────────────────────────────────────────────────────┘
Gateway Types
| Type | Protocol | Backend | Use Case |
|---|---|---|---|
| S3 File Gateway | NFS, SMB | S3 | File shares backed by S3 |
| FSx File Gateway | SMB | FSx for Windows | Low-latency FSx access |
| Volume Gateway | iSCSI | S3 + EBS | Block storage, DR |
| Tape Gateway | iSCSI (VTL) | S3, Glacier | Backup (replaces tapes) |
EFS Mount Target
Network endpoint (ENI) in a specific AZ for EC2 to connect to EFS.
┌─────────────────────────────────────────────────────────────────┐
│ VPC │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ AZ-a │ │ AZ-b │ │
│ │ │ │ │ │
│ │ EC2 ──► Mount Target │ │ EC2 ──► Mount Target │ │
│ │ (ENI) │ │ (ENI) │ │
│ │ 10.0.1.25 │ │ 10.0.2.30 │ │
│ └────────────┬────────────┘ └────────────┬────────────┘ │
│ └────────────┬─────────────────┘ │
│ ▼ │
│ EFS │
└─────────────────────────────────────────────────────────────────┘
Key Points
- One mount target per AZ (for low latency, no cross-AZ costs)
- Has its own security group (allow NFS port 2049)
- EFS DNS resolves to nearest mount target
AWS SAM (Serverless Application Model)
Framework for building serverless applications. Simplified CloudFormation + CLI tools.
SAM Template
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31 # ← SAM marker
Resources:
MyFunction:
Type: AWS::Serverless::Function # ← SAM resource
Properties:
Handler: index.handler
Runtime: python3.11
CodeUri: ./src
Events:
Api:
Type: Api
Properties:
Path: /hello
Method: GET
Automatically creates: Lambda + API Gateway + IAM role + permissions
SAM CLI Commands
sam init # Create new project
sam build # Install dependencies
sam local invoke # Run Lambda locally
sam local start-api # Local API Gateway
sam deploy # Deploy to AWS
cloudformation package
Uploads local files to S3 and rewrites template with S3 URLs.
BEFORE (template.yaml):
Code: ./src ← Local path
│
│ aws cloudformation package \
│ --template-file template.yaml \
│ --s3-bucket my-bucket \
│ --output-template-file packaged.yaml
▼
AFTER (packaged.yaml):
Code:
S3Bucket: my-bucket ← S3 reference
S3Key: abc123...
Workflow
# 1. Package: upload to S3, generate new template
aws cloudformation package \
--template-file template.yaml \
--s3-bucket my-bucket \
--output-template-file packaged.yaml
# 2. Deploy: use packaged template
aws cloudformation deploy \
--template-file packaged.yaml \
--stack-name my-stack
deploy reads local packaged.yaml file. S3 bucket/key is embedded in the template.
Trusted Advisor Service Limits
Checks that compare your current AWS resource usage against default service quotas (limits).
What It Does
Trusted Advisor Service Limits Check:
Your Usage Service Quota Status
─────────────────────────────────────────────────
45 EC2 instances 50 (default limit) ⚠️ 90% - Yellow (warning)
3 VPCs 5 (default limit) ✓ 60% - Green (OK)
5 Elastic IPs 5 (default limit) 🔴 100% - Red (at limit)
Status Thresholds
| Status | Meaning |
|---|---|
| 🟢 Green | Usage < 80% of limit |
| 🟡 Yellow | Usage ≥ 80% of limit (warning) |
| 🔴 Red | Usage ≥ 100% of limit (at or over) |
Example Checks
- EC2 On-Demand instances (per instance type, per region)
- VPCs per region
- Elastic IP addresses
- EBS volumes
- RDS instances
- IAM roles, users, groups
- S3 buckets
- Lambda concurrent executions
- Auto Scaling groups
Important Limitations
| Limitation | Detail |
|---|---|
| Default quotas only | Doesn’t know about quota increases you’ve requested |
| Not real-time | Refreshes periodically (manual refresh available) |
| Subset of services | Doesn’t cover all AWS services/quotas |
| Basic Support | Service Limits checks are free (unlike most Trusted Advisor checks) |
Service Limits vs Service Quotas
| Trusted Advisor Service Limits | Service Quotas (service) | |
|---|---|---|
| Purpose | Monitor usage vs limits | View/request quota increases |
| Shows current usage | Yes | Yes |
| Shows applied quotas | No (default only) | Yes (actual applied quota) |
| Request increases | No | Yes |
| API | support:DescribeTrustedAdvisorChecks | service-quotas:* |
Better Alternative: Service Quotas + CloudWatch
For accurate monitoring including custom quota increases:
Service Quotas → CloudWatch Metrics → CloudWatch Alarm
Metric: AWS/Usage → ResourceCount
Alarm: When usage > 80% of AppliedQuota
This reflects your actual quota (including increases), not just defaults.
When Trusted Advisor Service Limits Is Useful
- Quick overview across many services
- Accounts without quota increases (defaults apply)
- Free tier / Basic Support accounts (Service Limits checks are free)
cfn-init and cfn-hup
CloudFormation helper scripts that run on EC2 instances to configure them based on metadata in your template.
| Script | What it does |
|---|---|
| cfn-init | Reads metadata from template, configures the instance (install packages, create files, run commands) |
| cfn-hup | Daemon that watches for metadata changes and re-runs cfn-init when template is updated |
hup = HangUP signal (SIGHUP). In Unix, sending SIGHUP to a daemon tells it to reload configuration. cfn-hup = daemon that watches for config changes and reloads.
The Problem They Solve
UserData (imperative): cfn-init (declarative):
───────────────────── ────────────────────────
yum install -y httpd packages:
systemctl start httpd yum:
echo "hello" > /var/www/html/index httpd: []
services:
sysvinit:
httpd: {enabled: true}
files:
/var/www/html/index.html:
content: "hello"
Where Configuration Lives
In the Metadata section of your EC2 resource in the CloudFormation template:
Resources:
MyInstance:
Type: AWS::EC2::Instance
Metadata: # ← cfn-init reads this
AWS::CloudFormation::Init:
config:
packages:
yum:
httpd: []
files:
/var/www/html/index.html:
content: "Hello World"
services:
sysvinit:
httpd:
enabled: true
ensureRunning: true
Properties:
ImageId: ami-xxxxx
UserData: # ← Calls cfn-init
Fn::Base64: !Sub |
#!/bin/bash
yum install -y aws-cfn-bootstrap
/opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}
cfn-init Configuration Sections
| Section | What it configures |
|---|---|
packages | Install packages (yum, apt, rpm, python, rubygems) |
groups | Create Linux groups |
users | Create Linux users |
sources | Download and extract archives (tar, zip) |
files | Create files with content, permissions, owner |
commands | Run shell commands |
services | Enable/start/stop services (sysvinit, systemd) |
Execution order: packages → groups → users → sources → files → commands → services
cfn-hup Configuration Files
Two files needed on the instance:
| File | Purpose |
|---|---|
/etc/cfn/cfn-hup.conf | Main config: which stack to watch, poll interval |
/etc/cfn/hooks.d/*.conf | Hook definitions: what to run when changes detected |
files:
/etc/cfn/cfn-hup.conf:
content: !Sub |
[main]
stack=${AWS::StackId}
region=${AWS::Region}
interval=5
mode: "000400"
owner: root
group: root
/etc/cfn/hooks.d/cfn-auto-reloader.conf:
content: !Sub |
[cfn-auto-reloader-hook]
triggers=post.update
path=Resources.MyInstance.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}
runas=root
mode: "000400"
owner: root
group: root
services:
sysvinit:
cfn-hup:
enabled: true
ensureRunning: true
files:
- /etc/cfn/cfn-hup.conf
- /etc/cfn/hooks.d/cfn-auto-reloader.conf
What Actually Happens
Initial Deployment (cfn-init):
CloudFormation creates EC2 instance
↓
EC2 boots, runs UserData script
↓
UserData calls: /opt/aws/bin/cfn-init -s MyStack -r MyInstance
↓
cfn-init fetches Metadata from CloudFormation API
↓
cfn-init executes: packages → files → commands → services
↓
Instance configured and running
Stack Update (cfn-hup):
You update CloudFormation template (change Metadata)
↓
CloudFormation updates stack
↓
cfn-hup daemon polls every N minutes, detects change
↓
cfn-hup runs action: /opt/aws/bin/cfn-init ...
↓
cfn-init re-applies configuration
Summary
| Script | When it runs | Purpose |
|---|---|---|
| cfn-init | Once at instance launch (from UserData) | Initial configuration |
| cfn-hup | Continuously as daemon | Detect metadata changes, re-run cfn-init |
Without cfn-hup: Metadata changes require instance replacement or manual intervention.
With cfn-hup: Instance automatically reconfigures itself when you update the stack.
VPC Endpoints and PrivateLink
VPC Endpoint: Private connection from your VPC to a service (no internet needed).
PrivateLink: The underlying AWS technology that powers VPC endpoints.
Two Types of VPC Endpoints
| Type | What it connects to | How it works | Cost |
|---|---|---|---|
| Gateway Endpoint | S3, DynamoDB only | Route table entry (no ENI) | Free |
| Interface Endpoint | Most AWS services + your own services | Creates ENI in your subnet | ~$0.01/hr + data |
┌─────────────────────────────────────────────────────────────────┐
│ Your VPC │
│ │
│ Gateway Endpoint (S3/DynamoDB): │
│ - Entry in route table │
│ - No ENI, no IP address │
│ - Free │
│ │
│ Interface Endpoint (PrivateLink): │
│ - Creates ENI with private IP │
│ - Works for 100+ AWS services │
│ - Works for your own services (via NLB) │
└─────────────────────────────────────────────────────────────────┘
Is NLB Needed?
| Connecting to | NLB needed? |
|---|---|
| AWS services (S3, SQS, Lambda, etc.) | No—AWS manages it |
| Your own service in another VPC/account | Yes—you create NLB + Endpoint Service |
Cross-Account Connectivity
VPC endpoints connect to services, not VPCs directly. To connect to another VPC/account:
Provider Account Consumer Account
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ │ │ │
│ App ← NLB ← Endpoint │◄─────────│ Interface → App │
│ Service │PrivateLink│ Endpoint │
│ │ │ │
└─────────────────────────────┘ └─────────────────────────────┘
Provider creates: NLB + Endpoint Service + allow consumer accounts
Consumer creates: Interface Endpoint using provider’s service name
Setup Flow
# Provider: Create endpoint service
aws ec2 create-vpc-endpoint-service-configuration \
--network-load-balancer-arns <nlb-arn> \
--acceptance-required
# Returns: ServiceName: com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx
# Provider: Allow consumer account
aws ec2 modify-vpc-endpoint-service-permissions \
--service-id vpce-svc-xxxxxxxxx \
--add-allowed-principals arn:aws:iam::222222222222:root
# Consumer: Create interface endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-consumer \
--service-name com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx \
--vpc-endpoint-type Interface \
--subnet-ids subnet-aaa
# Provider: Accept connection (if acceptance-required)
aws ec2 accept-vpc-endpoint-connections \
--service-id vpce-svc-xxxxxxxxx \
--vpc-endpoint-ids vpce-xxxxxxxxx
How Consumer Routes Requests
Consumer app uses endpoint DNS or ENI private IP:
# Endpoint DNS (auto-provided)
curl http://vpce-xxx.vpce-svc-xxx.us-east-1.vpce.amazonaws.com
# Or ENI private IP directly
curl http://10.1.0.50
No route table changes needed—Interface Endpoint ENI handles routing automatically.
Summary
| Want to connect to… | What you need |
|---|---|
| S3 / DynamoDB | Gateway Endpoint (free, same region) |
| AWS services (SQS, Lambda, etc.) | Interface Endpoint |
| Another VPC/account’s app | Interface Endpoint → their Endpoint Service + NLB |