DOP-C02 notes

Aurora

AWS-managed relational database (MySQL/PostgreSQL compatible) with cloud-native architecture. Storage and compute are separated.

Aurora Cluster (Single Region)

One primary instance (read/write) + optional read replicas sharing the same storage.

      Writer Endpoint                    Reader Endpoint
            │                                  │
            ▼                                  ▼
     ┌──────────────┐              ┌──────────────┬──────────────┐
     │   Primary    │              │  Replica 1   │  Replica 2   │
     │  (Writer)    │              │  (Reader)    │  (Reader)    │
     └──────┬───────┘              └──────┬───────┴──────┬───────┘
            │                             │              │
            └─────────────┬───────────────┴──────────────┘
                          ▼
             ┌────────────────────────────────────────┐
             │     Shared Cluster Storage            │
             │     (6 copies across 3 AZs)           │
             │     Auto-grows up to 128 TB           │
             └────────────────────────────────────────┘

All instances share same storage (no replication lag for storage)
Replicas can be promoted to primary if primary fails (~30 seconds failover)
Up to 15 read replicas
Single region only

Aurora Storage

One logical storage automatically replicated across 3 AZs (6 copies total, 2 per AZ).

Write: Need 4 of 6 copies to acknowledge (can lose 2)
Read: Need 3 of 6 copies to respond (can lose 3)
Even if entire AZ fails (2 copies gone), writes still work

Aurora Global Database (Multi-Region)

Multiple Aurora clusters across different AWS regions with replication between them.

Primary Region (us-east-1)              Secondary Region (eu-west-1)
┌─────────────────────────┐            ┌─────────────────────────┐
│  Primary Cluster        │            │  Secondary Cluster      │
│  ┌────────┐ ┌────────┐  │            │  ┌────────┐ ┌────────┐  │
│  │Primary │ │Replica │  │            │  │Replica │ │Replica │  │
│  │(R/W)   │ │(R)     │  │            │  │(R only)│ │(R only)│  │
│  └────┬───┘ └────┬───┘  │            │  └────┬───┘ └────┬───┘  │
│       └─────┬────┘      │            │       └─────┬────┘      │
│             ▼           │   Async    │             ▼           │
│  ┌──────────────────┐   │  <1 sec    │  ┌──────────────────┐   │
│  │ Cluster Storage  │───┼───────────►│  │ Cluster Storage  │   │
│  └──────────────────┘   │            │  └──────────────────┘   │
└─────────────────────────┘            └─────────────────────────┘

Cross-region disaster recovery
Replication lag typically < 1 second
Secondary region is read-only until promoted
Up to 5 secondary regions

Comparison

Aspect	Aurora Cluster	Aurora Global Database
Scope	Single region	Multiple regions
Write location	Primary instance	Primary region only
Replication	Shared storage (instant)	Cross-region async (<1 sec)
Failover	~30 seconds (within region)	Minutes (cross-region)
Use case	HA within region	DR + global reads

See AWS RDS, Aurora, and EBS Storage Basics for details.

Auto Scaling Group (ASG)

Maintains a fleet of EC2 instances: launches when needed, terminates when not, replaces unhealthy ones.

Core Concept

Capacity Settings:
  Minimum: 2    (never go below)
  Desired: 4    (try to maintain)
  Maximum: 10   (never exceed)

Components Relationship

ALB ──► Target Group ◄─── ASG registers/deregisters instances automatically
              │                    │
              ▼                    │
         ┌─────────┐               │
         │ EC2-1   │ ◄─────────────┤ ASG launches
         │ EC2-2   │ ◄─────────────┤
         │ EC2-3   │ ◄─────────────┘
         └─────────┘

Launch Template: Defines instance config (AMI, instance type, SG, user data)
Target Group: List of instances ALB sends traffic to
ASG: Creates/terminates instances, registers them to Target Group

Scaling Types

Type	How It Works
Manual	You change desired capacity
Dynamic	CloudWatch alarm triggers scaling policy
Scheduled	Time-based (e.g., scale up at 9 AM)
Predictive	ML-based, scales proactively based on patterns

Dynamic Scaling Policies

Policy	Description
Target Tracking	“Keep CPU at 50%” - ASG figures out instance count
Step Scaling	Different actions at different thresholds
Simple Scaling	Single action when alarm triggers

Useful Metrics for Scaling

Workload	Recommended Metric
Web app behind ALB	RequestCountPerTarget (ALB)
API servers	CPUUtilization (EC2)
Queue workers	ApproximateNumberOfMessages (SQS)
Memory-intensive	mem_used_percent (requires CloudWatch Agent)

Note: Memory and disk space metrics require CloudWatch Agent because hypervisor cannot see inside VM. See EC2 CloudWatch Metrics - Why Some Require Agent for details.

Health Checks

ASG checks some status like EC2 status and ALB health check status (ALB marks instances as Unhealthy).

Type	Source	Use Case
EC2	EC2 status checks	Basic - is instance running?
ELB	ALB health check	App-level - is app responding?

Unhealthy instance → ASG terminates → launches replacement

Grace Period: Time after launch before health checks start (default 300s)

Key Features

Feature	Purpose
AZ Balancing	Distributes instances evenly across AZs
Termination Policies	Controls which instance to remove when scaling in
Lifecycle Hooks	Run custom actions during launch/terminate
Instance Refresh	Rolling update all instances (e.g., new AMI)
Warm Pools	Pre-initialized instances for faster scaling
Mixed Instances	Multiple instance types + Spot/On-Demand mix
Cooldown	Prevents rapid scale in/out oscillation

Mixed Instances Policy

Configured on ASG (not Launch Template). Allows multiple instance types and purchase options.

Instance Types: [t3.medium, t3.large, t3a.medium]
Purchase Options:
  On-Demand base: 2 instances
  Spot percentage: 80%

Spot vs On-Demand

Aspect	On-Demand	Spot
Price	Full price	60-90% discount
Availability	Always	When spare capacity exists
Interruption	Never	Can be interrupted (2-min warning)
Use case	Critical workloads	Batch jobs, fault-tolerant apps

ECS Task

A Task is a running instance of your containers - the actual process running based on a Task Definition.

Task vs Task Definition

Task Definition (blueprint):          Task (running instance):
┌─────────────────────────┐           ┌─────────────────────────┐
│ "Use nginx image"       │           │ nginx container running │
│ "Give it 512MB RAM"     │  ──run──► │ Using 512MB RAM         │
│ "Open port 80"          │           │ Listening on port 80    │
│ "Set ENV=production"    │           │ ENV=production set      │
└─────────────────────────┘           └─────────────────────────┘
       (JSON config)                       (actual process)

Two Ways to Run Tasks

Method	Behavior	Use Case
Service	Keeps desired count always running	Web servers, APIs
Standalone Task	Run once, then stop	Batch jobs, migrations

ECS Service (desired: 3 tasks):
┌─────────────────────────────────────────────┐
│  Task 1 (running)  ✓                        │
│  Task 2 (running)  ✓                        │
│  Task 3 (running)  ✓                        │
│                                             │
│  If Task 2 crashes → Service starts new one │
└─────────────────────────────────────────────┘

What’s Inside a Task

A task can have multiple containers that share network, storage, and lifecycle.

Task
┌─────────────────────────────────────────────┐
│  ┌─────────────┐    ┌─────────────┐         │
│  │ Container 1 │    │ Container 2 │         │
│  │ (nginx)     │◄──►│ (php-fpm)   │         │
│  │ port 80     │    │ port 9000   │         │
│  └─────────────┘    └─────────────┘         │
│         │                  │                │
│         └──── localhost ───┘                │
│                                             │
│  Shared: IP address, volumes, lifecycle     │
│  Task IP: 10.0.1.50                         │
└─────────────────────────────────────────────┘

Task Placement: One Task = One Instance

A task runs on exactly one EC2 instance. Cannot span multiple instances.

✓ Correct:
┌─────────────────┐    ┌─────────────────┐
│ EC2 Instance A  │    │ EC2 Instance B  │
│ ┌─────────────┐ │    │ ┌─────────────┐ │
│ │   Task 1    │ │    │ │   Task 2    │ │
│ └─────────────┘ │    │ └─────────────┘ │
└─────────────────┘    └─────────────────┘

✗ Not possible (task cannot span instances):
┌─────────────────┐    ┌─────────────────┐
│ EC2 Instance A  │    │ EC2 Instance B  │
│ ┌─────────────┐ │    │ ┌─────────────┐ │
│ │   Task 1    │◄┼────┼►│   Task 1    │ │
│ └─────────────┘ │    │ └─────────────┘ │
└─────────────────┘    └─────────────────┘

To scale, run multiple tasks across instances with a load balancer.

ECS on EC2 vs Fargate

	ECS on EC2	Fargate
Infrastructure	You manage EC2 instances	AWS manages
Kernel sharing	Tasks share EC2’s OS kernel	Each task has own micro-VM
Isolation	Process-level (namespaces)	Hardware-level (hypervisor)

ECS on EC2:
┌─────────────────────────────────────────────────────────┐
│  EC2 Instance (Guest OS)                                │
│  ┌─────────────────────────────────────────────────┐   │
│  │  Docker Engine                                   │   │
│  │  ┌─────────────┐  ┌─────────────┐               │   │
│  │  │ Task 1      │  │ Task 2      │  ← Share OS   │   │
│  │  │ (container) │  │ (container) │    kernel     │   │
│  │  └─────────────┘  └─────────────┘               │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Fargate (using Firecracker micro-VMs):
┌─────────────────────────────────────────────────────────┐
│  AWS-managed infrastructure                             │
│  ┌───────────────────┐    ┌───────────────────┐        │
│  │ micro-VM 1        │    │ micro-VM 2        │        │
│  │ ┌───────────────┐ │    │ ┌───────────────┐ │        │
│  │ │ Minimal Linux │ │    │ │ Minimal Linux │ │        │
│  │ │ Kernel        │ │    │ │ Kernel        │ │        │
│  │ ├───────────────┤ │    │ ├───────────────┤ │        │
│  │ │ Container     │ │    │ │ Container     │ │        │
│  │ └───────────────┘ │    │ └───────────────┘ │        │
│  └───────────────────┘    └───────────────────┘        │
│           ↑                        ↑                    │
│           └── Separate kernels, fully isolated ─────────┘
└─────────────────────────────────────────────────────────┘

Fargate uses micro-VMs for multi-tenant security - your task can’t access other customers’ tasks.

Task Lifecycle

PROVISIONING → PENDING → RUNNING → STOPPED
     │            │          │         │
     │            │          │         └─ Container exited or stopped
     │            │          └─ Containers running
     │            └─ Waiting for resources
     └─ Preparing to launch

EventBridge Task State Detection

ECS sends task state change events to EventBridge.

{
  "source": "aws.ecs",
  "detail-type": "ECS Task State Change",
  "detail": {
    "lastStatus": "STOPPED",
    "stoppedReason": "Essential container in task exited",
    "containers": [{ "name": "web", "exitCode": 1 }]
  }
}

EventBridge rule pattern:

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "lastStatus": ["STOPPED"]
  }
}

Kinesis Data Streams

Collect and process large amounts of real-time data (logs, events, clicks, IoT data).

Producers                    Kinesis Data Stream                 Consumers
┌─────────┐                 ┌─────────────────────┐             ┌─────────┐
│ App 1   │────►            │                     │        ────►│ Lambda  │
│ App 2   │────►  records   │   Stream            │  records  ─►│ EC2 App │
│ IoT     │────►            │                     │        ────►│ Firehose│
└─────────┘                 └─────────────────────┘             └─────────┘

Data stays in stream for 24 hours (default) up to 365 days. Multiple consumers can read same data.

Shard

A shard is a unit of capacity. More shards = more throughput.

Kinesis Data Stream (3 shards)
┌─────────────────────────────────────────────────────┐
│  ┌─────────────────┐  Shard 1: 1 MB/s in, 2 MB/s out│
│  │     Shard 1     │                               │
│  └─────────────────┘                               │
│  ┌─────────────────┐  Shard 2: 1 MB/s in, 2 MB/s out│
│  │     Shard 2     │                               │
│  └─────────────────┘                               │
│  ┌─────────────────┐  Shard 3: 1 MB/s in, 2 MB/s out│
│  │     Shard 3     │                               │
│  └─────────────────┘                               │
│  Total: 3 MB/s in, 6 MB/s out                      │
└─────────────────────────────────────────────────────┘

Per shard limits:

Direction	Limit
Write (in)	1 MB/sec or 1,000 records/sec
Read (out)	2 MB/sec

Partition key determines which shard receives each record (hash-based).

Record with partition_key="user123"
        ↓
    hash("user123") → Falls into Shard 2's range
        ↓
    Record stored in Shard 2

Same partition key → same shard → ordered processing for that key.

Enhanced Fan-Out

Gives each consumer dedicated throughput instead of sharing.

Standard (shared):
Shard ──────────────────────────────────────────────
              2 MB/sec total shared
              ┌──────────┼──────────┐
              ▼          ▼          ▼
         Consumer A  Consumer B  Consumer C
         ~0.67 MB/s  ~0.67 MB/s  ~0.67 MB/s

Enhanced Fan-Out (dedicated):
Shard ──────────────────────────────────────────────
              2 MB/sec each dedicated
              ┌──────────┼──────────┐
              ▼          ▼          ▼
         Consumer A  Consumer B  Consumer C
         2 MB/sec    2 MB/sec    2 MB/sec

	Standard	Enhanced Fan-Out
Throughput per shard	2 MB/sec shared	2 MB/sec per consumer
Delivery	Pull (GetRecords)	Push (SubscribeToShard)
Latency	~200ms	~70ms
Consumer registration	Not needed	Required
ARN used	Stream ARN	Consumer ARN

Standard mode: No consumer registration needed. GetRecords API and Lambda use stream ARN directly.

Enhanced Fan-Out: Must register consumer first, then use consumer ARN.

# Standard - no registration, use stream ARN
aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream \
  --starting-position LATEST

# Enhanced Fan-Out - register first, then use consumer ARN
aws kinesis register-stream-consumer \
  --stream-arn arn:aws:kinesis:us-east-1:123456789:stream/my-stream \
  --consumer-name my-consumer

aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream/consumer/my-consumer:123 \
  --starting-position LATEST

Batch Size and Batching Window

Control how records are delivered to Lambda.

aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream \
  --batch-size 100 \
  --maximum-batching-window-in-seconds 30 \
  --starting-position LATEST

Lambda invokes when EITHER condition is met:

batch-size records collected (default: 100, max: 10,000)
maximum-batching-window-in-seconds passed (default: 0, max: 300)

Records in 30 sec	What happens
150 records	Invokes at 100 records (batch size hit first)
50 records	Invokes at 30 seconds with 50 records (timeout hit first)
0 records	No invocation

Lambda Concurrency and Processing Settings

Key concept: 1 invocation = 1 Lambda instance. Multiple concurrent invocations = multiple instances.

Concurrency Quota: 1000 per region (default), which means 1000 Lambda instances at the same time.

Reserved Concurrency

Guarantee and limit concurrency for a specific function.

Without reserved concurrency:
  Function A spike could starve other functions

With reserved concurrency:
  Function A: reserved 100 (guaranteed, max 100)
  Function B: reserved 200 (guaranteed, max 200)
  Function C: unreserved (uses remaining 700)

Set to 0 = function disabled.

ParallelizationFactor

Process one Kinesis/DynamoDB shard with multiple Lambda instances in parallel.

ParallelizationFactor = 1 (default):
Shard 1 ──► Instance 1
Shard 2 ──► Instance 2
Total instances = 2

ParallelizationFactor = 3:
Shard 1 ──► Instance 1, Instance 2, Instance 3
Shard 2 ──► Instance 4, Instance 5, Instance 6
Total instances = shards × factor = 2 × 3 = 6

Max: 10

ReportBatchItemFailures

Retry only failed records, not entire batch.

Without ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry ALL [1,2,3,4,5]

With ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry from 3: [3,4,5]

How it works:

┌─────────────────────────────────────────────────────────────────┐
│ Lambda Service (AWS managed)                                    │
│                                                                 │
│  1. Pulls records from Kinesis shard                            │
│  2. Invokes your function with batch of records                 │
│  3. Reads your function's return value                          │
│  4. Retries only failed records based on your response          │
└─────────────────────────────────────────────────────────────────┘
         │                              ▲
         │ event.Records                │ return {"batchItemFailures": [...]}
         ▼                              │
┌─────────────────────────────────────────────────────────────────┐
│ Your Lambda Function Code                                       │
│  - Receives records (doesn't pull from Kinesis)                 │
│  - Processes them                                               │
│  - Returns which ones failed                                    │
└─────────────────────────────────────────────────────────────────┘

Enable:

aws lambda update-event-source-mapping \
  --uuid <mapping-uuid> \
  --function-response-types "ReportBatchItemFailures"

Lambda response:

def handler(event, context):
    failures = []
    for record in event['Records']:  # records from Kinesis
        try:
            process(record)
        except:
            failures.append({"itemIdentifier": record['kinesis']['sequenceNumber']})
    return {"batchItemFailures": failures}  # tell Lambda which failed

Kinesis Data Firehose

Fully managed delivery service. No consumer code needed.

Producers ──► Firehose ──► S3 / Redshift / OpenSearch / Splunk / HTTP

When to Use Firehose vs Data Streams

	Data Streams	Firehose
Purpose	Real-time processing	Delivery to storage
You write	Consumer code	Nothing
Latency	Milliseconds	60+ seconds (buffered)
Retention	24h - 365 days	None (delivers immediately)

Batching

Firehose buffers records and delivers as batched files, not individual records.

Without Firehose:           With Firehose:
Record 1 → file1.json       Record 1 ─┐
Record 2 → file2.json       Record 2 ─┼─► Buffer ──► one-big-file.json
Record 3 → file3.json       Record 3 ─┘
(millions of tiny files)    (fewer, larger files)

Buffer Settings

Setting	Range	Behavior
Buffer size	1-128 MB	Flush when size reached
Buffer interval	60-900 seconds	Flush when time elapsed

Whichever comes first triggers delivery.

Format Conversion

Firehose can convert JSON to columnar formats automatically:

JSON records ──► Firehose ──► Parquet/ORC files in S3

Better for Athena/Redshift queries (faster, cheaper)
Requires schema (from AWS Glue Data Catalog)

Optional Lambda Transform

Transform records before delivery:

Producers ──► Firehose ──► Lambda (transform) ──► S3
                              │
                              └── Add fields, filter, convert format

def handler(event, context):
    output = []
    for record in event['records']:
        payload = base64.b64decode(record['data']).decode('utf-8')
        # Transform the data
        transformed = payload.upper()
        output.append({
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(transformed.encode('utf-8')).decode('utf-8')
        })
    return {'records': output}

ECR Image Scanning

ECR scanning analyzes container images for security vulnerabilities (CVEs - Common Vulnerabilities and Exposures).

Two Scanning Options

	Basic Scanning	Enhanced Scanning
Engine	Clair (open source)	Amazon Inspector
Scope	OS packages only	OS packages + application dependencies
When	On-push or manual	Continuous (auto re-scan on new CVEs)
Cost	Free	Pay per image scanned

Basic Scanning

Uses Clair scanner. Only scans OS-level packages (apt, yum).

Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip)        │ ← NOT scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...)   │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04)           │ ← Scanned
└─────────────────────────────────────┘

Triggered on image push or manual API call
Results are static until next scan
New CVE discovered tomorrow → won’t know until re-scan

Enhanced Scanning

Uses Amazon Inspector. Scans OS packages AND application dependencies.

Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip)        │ ← Scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...)   │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04)           │ ← Scanned
└─────────────────────────────────────┘

Continuous monitoring - auto re-scans when new CVEs published
Integrates with EventBridge for alerts
Supports: Java (Maven), JavaScript (npm), Python (pip), Go, .NET

Key Terms

CVE: Publicly known vulnerability with unique ID (e.g., CVE-2021-44228 = Log4Shell)
Clair: Open-source container vulnerability scanner
Amazon Inspector: AWS service for automated vulnerability management

Building Container Images

Two main AWS services for building container images.

CodeBuild

General-purpose build service. Most common for container CI/CD.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  CodeCommit  │────►│  CodeBuild   │────►│     ECR      │
│  (source)    │     │ docker build │     │  (registry)  │
│              │     │ docker push  │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

buildspec.yml example:

version: 0.2
phases:
  pre_build:
    commands:
      - aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
  build:
    commands:
      - docker build -t $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION .
  post_build:
    commands:
      - docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION

Full control over build process
Integrates with CodePipeline
Can run tests, multi-stage builds, any custom logic

EC2 Image Builder

Automated image creation service. Can build AMIs or container images.

┌─────────────────────────────────────────────────────────────┐
│  EC2 Image Builder Pipeline                                 │
│                                                             │
│  Recipe ──► Build ──► Test ──► Distribute to ECR            │
└─────────────────────────────────────────────────────────────┘

Container Recipe options:

Use components (no Dockerfile) - Image Builder applies changes to base image
Provide Dockerfile from S3

Key terms:

Recipe: Base image + components or Dockerfile
Component: Reusable build/test action (install packages, configure, etc.)
Pipeline: Automated workflow with schedule

Console steps for container image:

Create Container Recipe - base image + components or Dockerfile S3 path + target ECR repo
Create Infrastructure Configuration - instance type, IAM role, VPC/subnet for build
Create Distribution Settings - target ECR repositories (can be cross-region/cross-account)
Create Pipeline - link recipe + infrastructure + distribution + schedule
Run Pipeline - builds and pushes to ECR

When to Use Which

Use Case	Better Choice
CI/CD triggered by code commits	CodeBuild
Scheduled golden image builds	EC2 Image Builder
Need component library (CIS benchmarks, etc.)	EC2 Image Builder
Custom build logic, tests, multi-stage	CodeBuild
Part of CodePipeline	CodeBuild

AWS App Runner

Fully managed service to run web apps/APIs. You provide code or container → App Runner handles everything.

You provide:                    App Runner handles:
┌─────────────────┐            ┌─────────────────────────────┐
│ Source code     │            │ Build                       │
│ (GitHub repo)   │───────────►│ Deploy                      │
│       OR        │            │ Scale (auto, including to 0)│
│ Container image │            │ Load balancing              │
│ (ECR)           │            │ HTTPS/TLS certificate       │
└─────────────────┘            │ Health checks               │
                               └─────────────────────────────┘
                                         │
                                         ▼
                               https://abc123.awsapprunner.com

Two Source Types

Source	How It Works
Source code (GitHub)	App Runner builds container automatically
Container image (ECR)	App Runner pulls and runs directly

Comparison with Other Compute

	App Runner	ECS Fargate	Lambda
You manage	Almost nothing	Task definitions, services, ALB	Function code
Scaling	Automatic	You configure	Automatic
Min instances	Can scale to 0	Min 1 task	N/A (event-driven)
Use case	Simple web apps	Complex container workloads	Event processing
Pricing	Per vCPU/memory hour	Per vCPU/memory hour	Per request + duration

Key Features

Auto scaling: Based on concurrent requests, can scale to zero
Auto deployments: Trigger on ECR push or GitHub commit
VPC Connector: Access private resources (RDS, ElastiCache) in VPC
Custom domain: Bring your own domain with automatic TLS

When to Use App Runner

Simple web apps, APIs, microservices
Want zero infrastructure management
Don’t need ECS features (service mesh, complex networking)
Acceptable to use App Runner’s opinionated defaults

AWS Backup

Centralized service to manage backups across multiple AWS services from one place.

Without AWS Backup:                    With AWS Backup:
┌─────────┐ ┌─────────┐ ┌─────────┐   ┌─────────────────────────────┐
│   EC2   │ │   RDS   │ │   EFS   │   │       AWS Backup            │
│ snapshot│ │ snapshot│ │ backup  │   │  One backup plan for all    │
│ config  │ │ config  │ │ config  │   │  ┌─────┬─────┬─────┐        │
└─────────┘ └─────────┘ └─────────┘   │  │ EC2 │ RDS │ EFS │        │
     ↓           ↓           ↓        │  └─────┴─────┴─────┘        │
  Manage each separately              └─────────────────────────────┘

Supported: EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3, etc.

Core Concepts

Concept	What It Is
Backup Plan	When and how to backup (schedule, retention, copy rules)
Resource Assignment	What to backup (by resource ID or tags)
Backup Vault	Where backups are stored (container for recovery points)
Recovery Point	The actual backup data (snapshot, AMI, etc.)

Backup Vault Features

Feature	Purpose
Encryption	All backups encrypted with KMS key
Access Policy	Control who can backup/restore/delete
Vault Lock	WORM - prevent deletion even by root (compliance)

Cross-Account Backup Copy

Copy recovery points to another AWS account for disaster recovery.

Source Account (111)                    Destination Account (222)
┌─────────────────────┐                ┌─────────────────────┐
│  Backup Plan        │                │  Backup Vault       │
│  ┌───────────────┐  │                │                     │
│  │ Copy Rule:    │  │   copy         │  Access Policy:     │
│  │ Dest Vault ARN│──┼───────────────►│  Allow 111 to       │
│  └───────────────┘  │                │  CopyIntoBackupVault│
│                     │                │                     │
│  Source Vault       │                │  Recovery Point     │
│  (30 days retention)│                │  (90 days retention)│
└─────────────────────┘                └─────────────────────┘

Setup required:

Source account: Backup plan with copy rule pointing to destination vault ARN
Destination account: Vault access policy allowing backup:CopyIntoBackupVault

Cross-Account KMS Encryption

Behavior depends on whether the service supports “independent encryption” by AWS Backup.

Services WITH independent encryption (DynamoDB advanced, EFS):

AWS Backup handles encryption at vault level
No KMS key sharing needed

Services WITHOUT independent encryption (RDS, EC2/EBS):

Backup encrypted with data source’s KMS key (not vault key)
Destination account’s AWSServiceRoleForBackup performs the copy
Source KMS key must grant kms:Decrypt to destination account’s service-linked role

{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::222222222222:role/aws-service-role/backup.amazonaws.com/AWSServiceRoleForBackup"
  },
  "Action": ["kms:Decrypt", "kms:CreateGrant"],
  "Resource": "*"
}

Destination vault re-encrypts with its own KMS key → each account controls its own copy independently.

IAM Roles Anywhere

Lets workloads outside AWS (on-premises, other clouds) get temporary AWS credentials using X.509 certificates.

Problem It Solves

Method	Issue
IAM User access keys	Long-term, can leak, manual rotation
EC2 Instance Profile	Only works on EC2

IAM Roles Anywhere = temporary credentials for external workloads.

How It Works

On-Premises Server                           AWS
┌─────────────────────────┐                 ┌─────────────────────────────┐
│                         │                 │  IAM Roles Anywhere         │
│  X.509 Certificate      │   1. Present    │                             │
│  (issued by your CA)    │─────cert───────►│  2. Validate cert against   │
│                         │                 │     Trust Anchor (your CA)  │
│                         │◄──temp creds────│  3. Return temporary        │
│  AWS CLI / SDK          │                 │     credentials for Role    │
└─────────────────────────┘                 └─────────────────────────────┘

Key Components

Component	What It Is
Trust Anchor	Your CA that AWS trusts (own CA or AWS Private CA)
Profile	Links Trust Anchor to IAM Role(s)
Role	IAM role with trust policy for `rolesanywhere.amazonaws.com`
X.509 Certificate	Installed on server, issued by your CA

Credential Helper Usage

# Direct command
aws_signing_helper credential-process \
  --certificate /path/to/cert.pem \
  --private-key /path/to/key.pem \
  --trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
  --profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
  --role-arn arn:aws:iam::111111111111:role/MyRole

# ~/.aws/config
[profile onprem]
credential_process = aws_signing_helper credential-process \
  --certificate /path/to/cert.pem \
  --private-key /path/to/key.pem \
  --trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
  --profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
  --role-arn arn:aws:iam::111111111111:role/MyRole

Then use: aws s3 ls --profile onprem

EFS (Elastic File System)

Managed NFS file system that multiple EC2 instances can access simultaneously.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  EC2 (AZ-a) │     │  EC2 (AZ-b) │     │  EC2 (AZ-c) │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │ NFS protocol (port 2049)
                           ▼
              ┌─────────────────────────┐
              │         EFS             │
              │   /shared-files/        │
              └─────────────────────────┘

Shared storage: Multiple instances read/write same files
Auto-scaling: Grows/shrinks automatically
Protocol: NFS v4 (Linux only)
Mount: sudo mount -t nfs4 fs-xxx.efs.region.amazonaws.com:/ /mnt/efs

On-Premises Access

On-prem servers can mount EFS over Direct Connect or VPN.

On-Premises ──── Direct Connect/VPN ──── VPC ──── EFS

FSx (Managed File Systems)

Managed file systems for specific use cases.

FSx Type	Protocol	Use Case
FSx for Windows File Server	SMB	Windows workloads, Active Directory
FSx for Lustre	Lustre	High-performance computing, ML
FSx for NetApp ONTAP	NFS, SMB, iSCSI	Enterprise, multi-protocol
FSx for OpenZFS	NFS	Linux workloads needing ZFS features

EFS vs FSx:

EFS = Simple NFS for Linux
FSx = Specialized file systems (Windows, HPC, enterprise)

Site-to-Site VPN

Encrypted tunnel over public internet connecting on-premises to AWS VPC.

On-Premises                                    AWS
┌─────────────────┐                           ┌─────────────────┐
│  Your Router    │      Public Internet      │  Virtual Private│
│  (Customer GW)  │───── Encrypted Tunnel ────│  Gateway (VGW)  │
│  10.0.0.0/16    │                           │  172.31.0.0/16  │
└─────────────────┘                           └─────────────────┘

Components

Component	What It Is
Customer Gateway (CGW)	AWS resource representing your on-prem router
Virtual Private Gateway (VGW)	VPN endpoint attached to one VPC
VPN Connection	Links CGW ↔ VGW, creates two tunnels for redundancy

How VPN Works (Encapsulation)

VPN wraps original packet inside encrypted outer packet. Original private IPs preserved.

Original: src=10.0.1.50 dst=172.31.1.100

After VPN encapsulation:
┌─────────────────────────────────────────────────────────────┐
│ Outer: src=203.0.113.50 dst=52.x.x.x (AWS VPN endpoint)     │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ENCRYPTED: src=10.0.1.50 dst=172.31.1.100 (preserved)   │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Routing Required

Both sides need routes pointing to VPN:

On-prem router:     172.31.0.0/16 → VPN tunnel
VPC route table:    10.0.0.0/16   → vgw-xxxxx

VPN vs Direct Connect

Aspect	VPN	Direct Connect
Connection	Over public internet	Dedicated physical cable
Setup time	Minutes	Weeks to months
Cost	Low	High
Bandwidth	Up to ~1.25 Gbps	1-100 Gbps
Latency	Variable	Consistent
Encryption	Built-in (IPsec)	Not by default

Transit Gateway (TGW)

Hub connecting multiple VPCs and on-premises networks.

                    ┌─────┐ ┌─────┐ ┌─────┐
                    │VPC-A│ │VPC-B│ │VPC-C│
                    └──┬──┘ └──┬──┘ └──┬──┘
                       └──────┼───────┘
                              │
                    ┌─────────▼─────────┐
                    │  Transit Gateway  │
                    └─────────┬─────────┘
                              │
                    ┌─────────┴─────────┐
                    │                   │
                    ▼                   ▼
              VPN to On-Prem     Direct Connect

Central hub - add new VPCs easily
VPN/Direct Connect connects once to TGW, reaches all VPCs
Route tables control which networks can communicate

S3 Event Notifications

Triggers actions when events happen in S3 bucket.

S3 Bucket ──► Event Notification ──► Lambda / SQS / SNS / EventBridge

Event Types

Category	Examples
Object created	`s3:ObjectCreated:Put`, `s3:ObjectCreated:Copy`
Object removed	`s3:ObjectRemoved:Delete`
Replication	`s3:Replication:OperationFailedReplication`
Lifecycle	`s3:LifecycleExpiration:*`, `s3:LifecycleTransition`
Restore	`s3:ObjectRestore:Completed`

S3 Notifications vs EventBridge

	S3 Event Notifications	S3 → EventBridge
Destinations	Lambda, SQS, SNS only	20+ AWS services
Filtering	Prefix/suffix only	Advanced (metadata, size)

S3 Batch Operations

Run operations on billions of objects at once.

Manifest (list of objects) ──► S3 Batch Job ──► Operation on all objects

Operations

Operation	Use Case
Copy	Migrate objects to another bucket
Invoke Lambda	Custom processing per object
Replace tags	Bulk update tags
Restore from Glacier	Bulk restore archived objects
Delete	Bulk delete

DAX (DynamoDB Accelerator)

In-memory cache for DynamoDB. Microsecond latency for reads.

Application
     │
     │ Same DynamoDB API
     ▼
┌─────────────┐
│    DAX      │ ← Microsecond (cache hit)
│   Cluster   │
└──────┬──────┘
       │ Cache miss
       ▼
┌─────────────┐
│  DynamoDB   │ ← Millisecond
└─────────────┘

API-compatible with DynamoDB (just change endpoint)
Use case: Read-heavy workloads needing microsecond latency

RDS Proxy

Connection pooler for RDS/Aurora. Solves connection exhaustion.

Lambda (100s concurrent)
     │ │ │ │ │
     ▼ ▼ ▼ ▼ ▼
┌─────────────────┐
│   RDS Proxy     │ ← Pools connections
└────────┬────────┘
         │ Few persistent connections
         ▼
┌─────────────────┐
│   RDS / Aurora  │
└─────────────────┘

Problem: Lambda spawns many connections, DB has limits
Solution: Proxy reuses connections from pool
Bonus: Faster failover for Aurora

DAX vs RDS Proxy

	DAX	RDS Proxy
For	DynamoDB	RDS / Aurora
Purpose	Caching (latency)	Connection pooling

AWS Service Catalog

Catalog of approved, pre-configured AWS resources for users to deploy.

Admin creates Products ──► Users see approved products only ──► Launch
(CloudFormation templates)     (from shared Portfolios)

Key Concepts

Term	What It Is
Product	CloudFormation template packaged for deployment
Portfolio	Collection of products, shared with users/accounts
Constraint	Rules (allowed parameters, launch role)

Restrictions

What	How
Allowed regions	Portfolio exists only in allowed regions
Allowed parameters	Template Constraint or `AllowedValues` in template
Permissions	Launch Constraint (IAM role used to deploy)

Template Constraint Example

{
  "Rules": {
    "InstanceTypeRule": {
      "Assertions": [{
        "Assert": {
          "Fn::Contains": [["t3.micro", "t3.small"], {"Ref": "InstanceType"}]
        },
        "AssertDescription": "Only t3.micro or t3.small allowed"
      }]
    }
  }
}

CloudFormation Custom Resource

Run your own Lambda code during stack operations. For things CloudFormation doesn’t natively support.

CloudFormation ──► Your Lambda ──► Does custom work ──► Reports back

Syntax

Resources:
  MyCustomResource:
    Type: Custom::AnyNameYouWant      # "Custom::" prefix required
    Properties:
      ServiceToken: !GetAtt MyLambda.Arn   # Required: Lambda ARN
      CustomParam1: value1                  # Your custom inputs
      CustomParam2: value2

Lambda Receives

{
  "RequestType": "Create",           
  "ResourceProperties": {
    "CustomParam1": "value1",
    "CustomParam2": "value2"
  },
  "ResponseURL": "https://..."       
}

Lambda Must

Check RequestType (Create, Update, Delete)
Do the work
Send success/failure to ResponseURL

Use Cases

Create resources in other regions
Call external APIs during deployment
Complex logic CloudFormation can’t express

Kubernetes Namespace

Virtual cluster division within a Kubernetes cluster. Groups and isolates resources.

┌─────────────────────────────────────────────────────────────────┐
│                        EKS Cluster                               │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: default                                       │    │
│  │  Deployment: web       Service: web-service              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: production                                    │    │
│  │  Deployment: api       ConfigMap: prod-config            │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: kube-system  (Kubernetes internal)            │    │
│  │  ConfigMap: aws-auth    DaemonSet: kube-proxy            │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Use Cases

Purpose	Example
Environment separation	`dev`, `staging`, `production` namespaces
Team separation	`team-a`, `team-b` namespaces
Access control	Team A can only access `team-a` namespace

DNS with Namespaces

Service DNS: <service-name>.<namespace>.svc.cluster.local

Examples:
- api-service.default.svc.cluster.local
- api-service.production.svc.cluster.local

Container Insights

CloudWatch feature that collects metrics and logs from containerized applications (ECS, EKS).

How It Works (EKS)

┌─────────────────────────────────────────────────────────────────┐
│                        EKS Cluster                               │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Node 1       │  │ Node 2       │  │ Node 3       │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │CloudWatch│ │  │ │CloudWatch│ │  │ │CloudWatch│ │           │
│  │ │Agent     │ │  │ │Agent     │ │  │ │Agent     │ │           │
│  │ │(DaemonSet)│ │  │ │(DaemonSet)│ │  │ │(DaemonSet)│ │          │
│  │ └────┬─────┘ │  │ └────┬─────┘ │  │ └────┬─────┘ │           │
│  └──────┼───────┘  └──────┼───────┘  └──────┼───────┘           │
│         └─────────────────┼─────────────────┘                    │
│                           ▼                                      │
│                  CloudWatch Metrics                              │
│                  (namespace: ContainerInsights)                  │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

Metric	What it measures
pod_memory_utilization	% of memory limit used
pod_cpu_utilization	% of CPU limit used
pod_memory_working_set	Actual bytes in use

Dimensions

Filter metrics by:

ClusterName
Namespace (Kubernetes namespace)
Service (Kubernetes Service name)
PodName
NodeName

AWS Glue Crawler

Automatically scans data sources and creates table definitions in Glue Data Catalog.

┌─────────────────────────────────────────────────────────────────┐
│  S3 Bucket (/data/)                                             │
│    sales.csv                                                    │
│    orders.json                                                  │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────┐  Detects:                                      │
│  │   Crawler   │  - File format (CSV, JSON, Parquet)            │
│  │             │  - Column names                                 │
│  │             │  - Data types                                   │
│  │             │  - Partitions (year=2024/month=01/)            │
│  └──────┬──────┘                                                │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Glue Data Catalog                                        │    │
│  │  Database: my_database                                   │    │
│  │  ├── Table: sales (id, product, amount)                 │    │
│  │  └── Table: orders (order_id, customer)                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│         │                                                        │
│         ▼                                                        │
│  Now queryable with Athena:                                     │
│  SELECT * FROM my_database.sales WHERE amount > 100             │
└─────────────────────────────────────────────────────────────────┘

AWS Glue ETL

Serverless data transformation jobs. ETL = Extract, Transform, Load.

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Extract    │    │  Transform  │    │    Load     │
│             │    │             │    │             │
│ Read from   │ →  │ Clean,      │ →  │ Write to    │
│ S3, RDS     │    │ filter,     │    │ S3, Redshift│
│             │    │ join        │    │             │
└─────────────┘    └─────────────┘    └─────────────┘

Key Points

Aspect	Description
Serverless	Pay per second of job runtime
Engine	Apache Spark
Write in	Python (PySpark) or Scala
Triggers	On-demand, scheduled, or event-based

Glue Components Together

S3 (raw) ──► Crawler ──► Data Catalog ──► ETL Job ──► S3 (clean)
                              │
                              ▼
                          Athena (query)

SigV4 (Signature Version 4)

AWS’s method for authenticating API requests. Every AWS API call must be signed.

What It Proves

You have valid AWS credentials
Request hasn’t been modified in transit
Request is recent (not replay attack)

The 4 Steps

1. Create Canonical Request
   - Standardize HTTP method, path, headers, body hash

2. Create String to Sign
   - Algorithm + timestamp + scope + hash of step 1

3. Calculate Signing Key
   - Chain HMAC-SHA256 from Secret Key → date → region → service

4. Calculate Signature
   - HMAC(signing_key, string_to_sign)

Result: Authorization Header

Authorization: AWS4-HMAC-SHA256
  Credential=AKIAIOSFODNN7EXAMPLE/20241229/us-east-1/s3/aws4_request,
  SignedHeaders=host;x-amz-date,
  Signature=abc123def456...

You don’t do this manually. AWS SDKs and CLI handle it automatically.

CodeArtifact Domain

Container that groups multiple repositories. Provides shared storage, permissions, encryption.

┌─────────────────────────────────────────────────────────────────┐
│                    CodeArtifact Domain                           │
│                    (name: my-company)                            │
│                                                                  │
│  Shared: KMS key, IAM policies, deduplication                   │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Repository:     │  │ Repository:     │  │ Repository:     │  │
│  │ npm-prod        │  │ npm-dev         │  │ python-internal │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Upstream Repository

Repository that another repository pulls from when package not found locally.

Developer: npm install lodash
     │
     ▼
my-npm-repo (not found) ──► npm-public-proxy (not found) ──► npmjs.org
     │                              │                            │
     │◄─────────────────────────────┼────────────────────────────┘
     │         Package cached at each level

Benefits

Single endpoint for internal + public packages
Caching (faster installs, works if npmjs down)
Audit all package downloads

FSx Types Comparison

FSx Type	Protocol	Best For
Windows File Server	SMB	Windows apps, Active Directory
Lustre	Lustre	HPC, ML, high-throughput
NetApp ONTAP	NFS, SMB, iSCSI	Enterprise, multi-protocol
OpenZFS	NFS	Linux workloads, snapshots

Key Differences

	Windows	Lustre	NetApp ONTAP	OpenZFS
OS support	Windows	Linux only	All	Linux, macOS
AD required	Yes	No	Optional	No
S3 integration	No	Yes (native)	No	No
Multi-protocol	No	No	Yes	No
Snapshots	Shadow copies	No	Yes	Yes
Multi-AZ	Yes	No	Yes	No

EFS vs FSx for OpenZFS

Both are NFS for Linux, different design goals.

Aspect	EFS	FSx for OpenZFS
Capacity	Auto-scales	You provision
Performance	Scales with size	Up to 1M IOPS
Latency	Milliseconds	Sub-millisecond
Snapshots	No	Yes (instant)
Clones	No	Yes (instant)
Multi-AZ	Yes	No
Best for	Shared storage, CMS	Databases, analytics

AWS Storage Gateway

Hybrid storage connecting on-premises to AWS cloud storage.

┌─────────────────────────────────────────────────────────────────┐
│  On-Premises                                                     │
│                                                                  │
│  Application ──NFS/SMB/iSCSI──► Storage Gateway ──► AWS (S3,    │
│                                 (VM or hardware)     EBS,       │
│                                 Local cache          Glacier)   │
└─────────────────────────────────────────────────────────────────┘

Gateway Types

Type	Protocol	Backend	Use Case
S3 File Gateway	NFS, SMB	S3	File shares backed by S3
FSx File Gateway	SMB	FSx for Windows	Low-latency FSx access
Volume Gateway	iSCSI	S3 + EBS	Block storage, DR
Tape Gateway	iSCSI (VTL)	S3, Glacier	Backup (replaces tapes)

EFS Mount Target

Network endpoint (ENI) in a specific AZ for EC2 to connect to EFS.

┌─────────────────────────────────────────────────────────────────┐
│                           VPC                                    │
│                                                                  │
│  ┌─────────────────────────┐    ┌─────────────────────────┐     │
│  │      AZ-a               │    │      AZ-b               │     │
│  │                         │    │                         │     │
│  │  EC2 ──► Mount Target   │    │  EC2 ──► Mount Target   │     │
│  │          (ENI)          │    │          (ENI)          │     │
│  │          10.0.1.25      │    │          10.0.2.30      │     │
│  └────────────┬────────────┘    └────────────┬────────────┘     │
│               └────────────┬─────────────────┘                   │
│                            ▼                                     │
│                          EFS                                     │
└─────────────────────────────────────────────────────────────────┘

Key Points

One mount target per AZ (for low latency, no cross-AZ costs)
Has its own security group (allow NFS port 2049)
EFS DNS resolves to nearest mount target

AWS SAM (Serverless Application Model)

Framework for building serverless applications. Simplified CloudFormation + CLI tools.

SAM Template

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31    # ← SAM marker

Resources:
  MyFunction:
    Type: AWS::Serverless::Function       # ← SAM resource
    Properties:
      Handler: index.handler
      Runtime: python3.11
      CodeUri: ./src
      Events:
        Api:
          Type: Api
          Properties:
            Path: /hello
            Method: GET

Automatically creates: Lambda + API Gateway + IAM role + permissions

SAM CLI Commands

sam init          # Create new project
sam build         # Install dependencies
sam local invoke  # Run Lambda locally
sam local start-api  # Local API Gateway
sam deploy        # Deploy to AWS

cloudformation package

Uploads local files to S3 and rewrites template with S3 URLs.

BEFORE (template.yaml):
  Code: ./src              ← Local path

        │
        │ aws cloudformation package \
        │   --template-file template.yaml \
        │   --s3-bucket my-bucket \
        │   --output-template-file packaged.yaml
        ▼

AFTER (packaged.yaml):
  Code:
    S3Bucket: my-bucket    ← S3 reference
    S3Key: abc123...

Workflow

# 1. Package: upload to S3, generate new template
aws cloudformation package \
  --template-file template.yaml \
  --s3-bucket my-bucket \
  --output-template-file packaged.yaml

# 2. Deploy: use packaged template
aws cloudformation deploy \
  --template-file packaged.yaml \
  --stack-name my-stack

deploy reads local packaged.yaml file. S3 bucket/key is embedded in the template.

Trusted Advisor Service Limits

Checks that compare your current AWS resource usage against default service quotas (limits).

What It Does

Trusted Advisor Service Limits Check:

  Your Usage          Service Quota          Status
  ─────────────────────────────────────────────────
  45 EC2 instances    50 (default limit)     ⚠️ 90% - Yellow (warning)
  3 VPCs              5 (default limit)      ✓ 60% - Green (OK)
  5 Elastic IPs       5 (default limit)      🔴 100% - Red (at limit)

Status Thresholds

Status	Meaning
🟢 Green	Usage < 80% of limit
🟡 Yellow	Usage ≥ 80% of limit (warning)
🔴 Red	Usage ≥ 100% of limit (at or over)

Example Checks

EC2 On-Demand instances (per instance type, per region)
VPCs per region
Elastic IP addresses
EBS volumes
RDS instances
IAM roles, users, groups
S3 buckets
Lambda concurrent executions
Auto Scaling groups

Important Limitations

Limitation	Detail
Default quotas only	Doesn’t know about quota increases you’ve requested
Not real-time	Refreshes periodically (manual refresh available)
Subset of services	Doesn’t cover all AWS services/quotas
Basic Support	Service Limits checks are free (unlike most Trusted Advisor checks)

Service Limits vs Service Quotas

	Trusted Advisor Service Limits	Service Quotas (service)
Purpose	Monitor usage vs limits	View/request quota increases
Shows current usage	Yes	Yes
Shows applied quotas	No (default only)	Yes (actual applied quota)
Request increases	No	Yes
API	`support:DescribeTrustedAdvisorChecks`	`service-quotas:*`

Better Alternative: Service Quotas + CloudWatch

For accurate monitoring including custom quota increases:

Service Quotas → CloudWatch Metrics → CloudWatch Alarm

  Metric: AWS/Usage → ResourceCount
  Alarm: When usage > 80% of AppliedQuota

This reflects your actual quota (including increases), not just defaults.

When Trusted Advisor Service Limits Is Useful

Quick overview across many services
Accounts without quota increases (defaults apply)
Free tier / Basic Support accounts (Service Limits checks are free)

cfn-init and cfn-hup

CloudFormation helper scripts that run on EC2 instances to configure them based on metadata in your template.

Script	What it does
cfn-init	Reads metadata from template, configures the instance (install packages, create files, run commands)
cfn-hup	Daemon that watches for metadata changes and re-runs cfn-init when template is updated

hup = HangUP signal (SIGHUP). In Unix, sending SIGHUP to a daemon tells it to reload configuration. cfn-hup = daemon that watches for config changes and reloads.

The Problem They Solve

UserData (imperative):              cfn-init (declarative):
─────────────────────               ────────────────────────
yum install -y httpd                packages:
systemctl start httpd                 yum:
echo "hello" > /var/www/html/index    httpd: []
                                    services:
                                      sysvinit:
                                        httpd: {enabled: true}
                                    files:
                                      /var/www/html/index.html:
                                        content: "hello"

Where Configuration Lives

In the Metadata section of your EC2 resource in the CloudFormation template:

Resources:
  MyInstance:
    Type: AWS::EC2::Instance
    Metadata:                          # ← cfn-init reads this
      AWS::CloudFormation::Init:
        config:
          packages:
            yum:
              httpd: []
          files:
            /var/www/html/index.html:
              content: "Hello World"
          services:
            sysvinit:
              httpd:
                enabled: true
                ensureRunning: true
    Properties:
      ImageId: ami-xxxxx
      UserData:                        # ← Calls cfn-init
        Fn::Base64: !Sub |
          #!/bin/bash
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}

cfn-init Configuration Sections

Section	What it configures
`packages`	Install packages (yum, apt, rpm, python, rubygems)
`groups`	Create Linux groups
`users`	Create Linux users
`sources`	Download and extract archives (tar, zip)
`files`	Create files with content, permissions, owner
`commands`	Run shell commands
`services`	Enable/start/stop services (sysvinit, systemd)

Execution order: packages → groups → users → sources → files → commands → services

cfn-hup Configuration Files

Two files needed on the instance:

File	Purpose
`/etc/cfn/cfn-hup.conf`	Main config: which stack to watch, poll interval
`/etc/cfn/hooks.d/*.conf`	Hook definitions: what to run when changes detected

files:
  /etc/cfn/cfn-hup.conf:
    content: !Sub |
      [main]
      stack=${AWS::StackId}
      region=${AWS::Region}
      interval=5
    mode: "000400"
    owner: root
    group: root
  
  /etc/cfn/hooks.d/cfn-auto-reloader.conf:
    content: !Sub |
      [cfn-auto-reloader-hook]
      triggers=post.update
      path=Resources.MyInstance.Metadata.AWS::CloudFormation::Init
      action=/opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}
      runas=root
    mode: "000400"
    owner: root
    group: root

services:
  sysvinit:
    cfn-hup:
      enabled: true
      ensureRunning: true
      files:
        - /etc/cfn/cfn-hup.conf
        - /etc/cfn/hooks.d/cfn-auto-reloader.conf

What Actually Happens

Initial Deployment (cfn-init):

CloudFormation creates EC2 instance
        ↓
EC2 boots, runs UserData script
        ↓
UserData calls: /opt/aws/bin/cfn-init -s MyStack -r MyInstance
        ↓
cfn-init fetches Metadata from CloudFormation API
        ↓
cfn-init executes: packages → files → commands → services
        ↓
Instance configured and running

Stack Update (cfn-hup):

You update CloudFormation template (change Metadata)
        ↓
CloudFormation updates stack
        ↓
cfn-hup daemon polls every N minutes, detects change
        ↓
cfn-hup runs action: /opt/aws/bin/cfn-init ...
        ↓
cfn-init re-applies configuration

Summary

Script	When it runs	Purpose
cfn-init	Once at instance launch (from UserData)	Initial configuration
cfn-hup	Continuously as daemon	Detect metadata changes, re-run cfn-init

Without cfn-hup: Metadata changes require instance replacement or manual intervention.

With cfn-hup: Instance automatically reconfigures itself when you update the stack.

VPC Endpoints and PrivateLink

VPC Endpoint: Private connection from your VPC to a service (no internet needed).

PrivateLink: The underlying AWS technology that powers VPC endpoints.

Two Types of VPC Endpoints

Type	What it connects to	How it works	Cost
Gateway Endpoint	S3, DynamoDB only	Route table entry (no ENI)	Free
Interface Endpoint	Most AWS services + your own services	Creates ENI in your subnet	~$0.01/hr + data

┌─────────────────────────────────────────────────────────────────┐
│  Your VPC                                                       │
│                                                                 │
│  Gateway Endpoint (S3/DynamoDB):                                │
│    - Entry in route table                                       │
│    - No ENI, no IP address                                      │
│    - Free                                                       │
│                                                                 │
│  Interface Endpoint (PrivateLink):                              │
│    - Creates ENI with private IP                                │
│    - Works for 100+ AWS services                                │
│    - Works for your own services (via NLB)                      │
└─────────────────────────────────────────────────────────────────┘

Is NLB Needed?

Connecting to	NLB needed?
AWS services (S3, SQS, Lambda, etc.)	No—AWS manages it
Your own service in another VPC/account	Yes—you create NLB + Endpoint Service

Cross-Account Connectivity

VPC endpoints connect to services, not VPCs directly. To connect to another VPC/account:

Provider Account                          Consumer Account
┌─────────────────────────────┐          ┌─────────────────────────────┐
│                             │          │                             │
│  App ← NLB ← Endpoint       │◄─────────│  Interface    → App         │
│             Service         │PrivateLink│  Endpoint                   │
│                             │          │                             │
└─────────────────────────────┘          └─────────────────────────────┘

Provider creates: NLB + Endpoint Service + allow consumer accounts

Consumer creates: Interface Endpoint using provider’s service name

Setup Flow

# Provider: Create endpoint service
aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns <nlb-arn> \
  --acceptance-required
# Returns: ServiceName: com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx

# Provider: Allow consumer account
aws ec2 modify-vpc-endpoint-service-permissions \
  --service-id vpce-svc-xxxxxxxxx \
  --add-allowed-principals arn:aws:iam::222222222222:root

# Consumer: Create interface endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-consumer \
  --service-name com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-aaa

# Provider: Accept connection (if acceptance-required)
aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-xxxxxxxxx \
  --vpc-endpoint-ids vpce-xxxxxxxxx

How Consumer Routes Requests

Consumer app uses endpoint DNS or ENI private IP:

# Endpoint DNS (auto-provided)
curl http://vpce-xxx.vpce-svc-xxx.us-east-1.vpce.amazonaws.com

# Or ENI private IP directly
curl http://10.1.0.50

No route table changes needed—Interface Endpoint ENI handles routing automatically.

Summary

Want to connect to…	What you need
S3 / DynamoDB	Gateway Endpoint (free, same region)
AWS services (SQS, Lambda, etc.)	Interface Endpoint
Another VPC/account’s app	Interface Endpoint → their Endpoint Service + NLB

Aurora#

Aurora Cluster (Single Region)#

Aurora Storage#

Aurora Global Database (Multi-Region)#

Comparison#

Auto Scaling Group (ASG)#

Core Concept#

Components Relationship#

Scaling Types#

Dynamic Scaling Policies#

Useful Metrics for Scaling#

Health Checks#

Key Features#

Mixed Instances Policy#

Spot vs On-Demand#

ECS Task#

Task vs Task Definition#

Two Ways to Run Tasks#

What’s Inside a Task#

Task Placement: One Task = One Instance#

ECS on EC2 vs Fargate#

Task Lifecycle#

EventBridge Task State Detection#

Kinesis Data Streams#

Shard#

Enhanced Fan-Out#

Batch Size and Batching Window#

Lambda Concurrency and Processing Settings#

Reserved Concurrency#

ParallelizationFactor#

ReportBatchItemFailures#

Kinesis Data Firehose#

When to Use Firehose vs Data Streams#

Batching#

Buffer Settings#

Format Conversion#

Optional Lambda Transform#

ECR Image Scanning#

Two Scanning Options#

Basic Scanning#

Enhanced Scanning#

Key Terms#

Building Container Images#

CodeBuild#

EC2 Image Builder#

When to Use Which#

AWS App Runner#

Two Source Types#

Comparison with Other Compute#

Key Features#

When to Use App Runner#

AWS Backup#

Core Concepts#

Backup Vault Features#

Cross-Account Backup Copy#

Cross-Account KMS Encryption#

IAM Roles Anywhere#

Problem It Solves#

How It Works#

Key Components#

Credential Helper Usage#

EFS (Elastic File System)#

On-Premises Access#

FSx (Managed File Systems)#

Site-to-Site VPN#

Components#

How VPN Works (Encapsulation)#

Routing Required#

VPN vs Direct Connect#

Transit Gateway (TGW)#

S3 Event Notifications#

Event Types#

S3 Notifications vs EventBridge#

S3 Batch Operations#

Operations#

DAX (DynamoDB Accelerator)#

RDS Proxy#

DAX vs RDS Proxy#

AWS Service Catalog#

Key Concepts#

Aurora

Aurora Cluster (Single Region)

Aurora Storage

Aurora Global Database (Multi-Region)

Comparison

Auto Scaling Group (ASG)

Core Concept

Components Relationship

Scaling Types

Dynamic Scaling Policies

Useful Metrics for Scaling

Health Checks

Key Features

Mixed Instances Policy

Spot vs On-Demand

ECS Task

Task vs Task Definition

Two Ways to Run Tasks

What’s Inside a Task

Task Placement: One Task = One Instance

ECS on EC2 vs Fargate

Task Lifecycle

EventBridge Task State Detection

Kinesis Data Streams

Shard

Enhanced Fan-Out

Batch Size and Batching Window

Lambda Concurrency and Processing Settings

Reserved Concurrency

ParallelizationFactor

ReportBatchItemFailures

Kinesis Data Firehose

When to Use Firehose vs Data Streams

Batching

Buffer Settings

Format Conversion

Optional Lambda Transform

ECR Image Scanning

Two Scanning Options

Basic Scanning

Enhanced Scanning

Key Terms

Building Container Images

CodeBuild

EC2 Image Builder

When to Use Which

AWS App Runner

Two Source Types

Comparison with Other Compute

Key Features

When to Use App Runner

AWS Backup

Core Concepts

Backup Vault Features

Cross-Account Backup Copy

Cross-Account KMS Encryption

IAM Roles Anywhere

Problem It Solves

How It Works

Key Components

Credential Helper Usage

EFS (Elastic File System)

On-Premises Access

FSx (Managed File Systems)

Site-to-Site VPN

Components

How VPN Works (Encapsulation)

Routing Required

VPN vs Direct Connect

Transit Gateway (TGW)

S3 Event Notifications

Event Types

S3 Notifications vs EventBridge

S3 Batch Operations

Operations

DAX (DynamoDB Accelerator)

RDS Proxy

DAX vs RDS Proxy

AWS Service Catalog

Key Concepts