Aurora

AWS-managed relational database (MySQL/PostgreSQL compatible) with cloud-native architecture. Storage and compute are separated.

Aurora Cluster (Single Region)

One primary instance (read/write) + optional read replicas sharing the same storage.

      Writer Endpoint                    Reader Endpoint
            │                                  │
            ▼                                  ▼
     ┌──────────────┐              ┌──────────────┬──────────────┐
     │   Primary    │              │  Replica 1   │  Replica 2   │
     │  (Writer)    │              │  (Reader)    │  (Reader)    │
     └──────┬───────┘              └──────┬───────┴──────┬───────┘
            │                             │              │
            └─────────────┬───────────────┴──────────────┘
                          ▼
             ┌────────────────────────────────────────┐
             │     Shared Cluster Storage            │
             │     (6 copies across 3 AZs)           │
             │     Auto-grows up to 128 TB           │
             └────────────────────────────────────────┘
  • All instances share same storage (no replication lag for storage)
  • Replicas can be promoted to primary if primary fails (~30 seconds failover)
  • Up to 15 read replicas
  • Single region only

Aurora Storage

One logical storage automatically replicated across 3 AZs (6 copies total, 2 per AZ).

  • Write: Need 4 of 6 copies to acknowledge (can lose 2)
  • Read: Need 3 of 6 copies to respond (can lose 3)
  • Even if entire AZ fails (2 copies gone), writes still work

Aurora Global Database (Multi-Region)

Multiple Aurora clusters across different AWS regions with replication between them.

Primary Region (us-east-1)              Secondary Region (eu-west-1)
┌─────────────────────────┐            ┌─────────────────────────┐
│  Primary Cluster        │            │  Secondary Cluster      │
│  ┌────────┐ ┌────────┐  │            │  ┌────────┐ ┌────────┐  │
│  │Primary │ │Replica │  │            │  │Replica │ │Replica │  │
│  │(R/W)   │ │(R)     │  │            │  │(R only)│ │(R only)│  │
│  └────┬───┘ └────┬───┘  │            │  └────┬───┘ └────┬───┘  │
│       └─────┬────┘      │            │       └─────┬────┘      │
│             ▼           │   Async    │             ▼           │
│  ┌──────────────────┐   │  <1 sec    │  ┌──────────────────┐   │
│  │ Cluster Storage  │───┼───────────►│  │ Cluster Storage  │   │
│  └──────────────────┘   │            │  └──────────────────┘   │
└─────────────────────────┘            └─────────────────────────┘
  • Cross-region disaster recovery
  • Replication lag typically < 1 second
  • Secondary region is read-only until promoted
  • Up to 5 secondary regions

Comparison

AspectAurora ClusterAurora Global Database
ScopeSingle regionMultiple regions
Write locationPrimary instancePrimary region only
ReplicationShared storage (instant)Cross-region async (<1 sec)
Failover~30 seconds (within region)Minutes (cross-region)
Use caseHA within regionDR + global reads

See AWS RDS, Aurora, and EBS Storage Basics for details.

Auto Scaling Group (ASG)

Maintains a fleet of EC2 instances: launches when needed, terminates when not, replaces unhealthy ones.

Core Concept

Capacity Settings:
  Minimum: 2    (never go below)
  Desired: 4    (try to maintain)
  Maximum: 10   (never exceed)

Components Relationship

ALB ──► Target Group ◄─── ASG registers/deregisters instances automatically
              │                    │
              ▼                    │
         ┌─────────┐               │
         │ EC2-1   │ ◄─────────────┤ ASG launches
         │ EC2-2   │ ◄─────────────┤
         │ EC2-3   │ ◄─────────────┘
         └─────────┘
  • Launch Template: Defines instance config (AMI, instance type, SG, user data)
  • Target Group: List of instances ALB sends traffic to
  • ASG: Creates/terminates instances, registers them to Target Group

Scaling Types

TypeHow It Works
ManualYou change desired capacity
DynamicCloudWatch alarm triggers scaling policy
ScheduledTime-based (e.g., scale up at 9 AM)
PredictiveML-based, scales proactively based on patterns

Dynamic Scaling Policies

PolicyDescription
Target Tracking“Keep CPU at 50%” - ASG figures out instance count
Step ScalingDifferent actions at different thresholds
Simple ScalingSingle action when alarm triggers

Useful Metrics for Scaling

WorkloadRecommended Metric
Web app behind ALBRequestCountPerTarget (ALB)
API serversCPUUtilization (EC2)
Queue workersApproximateNumberOfMessages (SQS)
Memory-intensivemem_used_percent (requires CloudWatch Agent)

Note: Memory and disk space metrics require CloudWatch Agent because hypervisor cannot see inside VM. See EC2 CloudWatch Metrics - Why Some Require Agent for details.

Health Checks

ASG checks some status like EC2 status and ALB health check status (ALB marks instances as Unhealthy).

TypeSourceUse Case
EC2EC2 status checksBasic - is instance running?
ELBALB health checkApp-level - is app responding?

Unhealthy instance → ASG terminates → launches replacement

Grace Period: Time after launch before health checks start (default 300s)

Key Features

FeaturePurpose
AZ BalancingDistributes instances evenly across AZs
Termination PoliciesControls which instance to remove when scaling in
Lifecycle HooksRun custom actions during launch/terminate
Instance RefreshRolling update all instances (e.g., new AMI)
Warm PoolsPre-initialized instances for faster scaling
Mixed InstancesMultiple instance types + Spot/On-Demand mix
CooldownPrevents rapid scale in/out oscillation

Mixed Instances Policy

Configured on ASG (not Launch Template). Allows multiple instance types and purchase options.

Instance Types: [t3.medium, t3.large, t3a.medium]
Purchase Options:
  On-Demand base: 2 instances
  Spot percentage: 80%

Spot vs On-Demand

AspectOn-DemandSpot
PriceFull price60-90% discount
AvailabilityAlwaysWhen spare capacity exists
InterruptionNeverCan be interrupted (2-min warning)
Use caseCritical workloadsBatch jobs, fault-tolerant apps

ECS Task

A Task is a running instance of your containers - the actual process running based on a Task Definition.

Task vs Task Definition

Task Definition (blueprint):          Task (running instance):
┌─────────────────────────┐           ┌─────────────────────────┐
│ "Use nginx image"       │           │ nginx container running │
│ "Give it 512MB RAM"     │  ──run──► │ Using 512MB RAM         │
│ "Open port 80"          │           │ Listening on port 80    │
│ "Set ENV=production"    │           │ ENV=production set      │
└─────────────────────────┘           └─────────────────────────┘
       (JSON config)                       (actual process)

Two Ways to Run Tasks

MethodBehaviorUse Case
ServiceKeeps desired count always runningWeb servers, APIs
Standalone TaskRun once, then stopBatch jobs, migrations
ECS Service (desired: 3 tasks):
┌─────────────────────────────────────────────┐
│  Task 1 (running)  ✓                        │
│  Task 2 (running)  ✓                        │
│  Task 3 (running)  ✓                        │
│                                             │
│  If Task 2 crashes → Service starts new one │
└─────────────────────────────────────────────┘

What’s Inside a Task

A task can have multiple containers that share network, storage, and lifecycle.

Task
┌─────────────────────────────────────────────┐
│  ┌─────────────┐    ┌─────────────┐         │
│  │ Container 1 │    │ Container 2 │         │
│  │ (nginx)     │◄──►│ (php-fpm)   │         │
│  │ port 80     │    │ port 9000   │         │
│  └─────────────┘    └─────────────┘         │
│         │                  │                │
│         └──── localhost ───┘                │
│                                             │
│  Shared: IP address, volumes, lifecycle     │
│  Task IP: 10.0.1.50                         │
└─────────────────────────────────────────────┘

Task Placement: One Task = One Instance

A task runs on exactly one EC2 instance. Cannot span multiple instances.

✓ Correct:
┌─────────────────┐    ┌─────────────────┐
│ EC2 Instance A  │    │ EC2 Instance B  │
│ ┌─────────────┐ │    │ ┌─────────────┐ │
│ │   Task 1    │ │    │ │   Task 2    │ │
│ └─────────────┘ │    │ └─────────────┘ │
└─────────────────┘    └─────────────────┘

✗ Not possible (task cannot span instances):
┌─────────────────┐    ┌─────────────────┐
│ EC2 Instance A  │    │ EC2 Instance B  │
│ ┌─────────────┐ │    │ ┌─────────────┐ │
│ │   Task 1    │◄┼────┼►│   Task 1    │ │
│ └─────────────┘ │    │ └─────────────┘ │
└─────────────────┘    └─────────────────┘

To scale, run multiple tasks across instances with a load balancer.

ECS on EC2 vs Fargate

ECS on EC2Fargate
InfrastructureYou manage EC2 instancesAWS manages
Kernel sharingTasks share EC2’s OS kernelEach task has own micro-VM
IsolationProcess-level (namespaces)Hardware-level (hypervisor)
ECS on EC2:
┌─────────────────────────────────────────────────────────┐
│  EC2 Instance (Guest OS)                                │
│  ┌─────────────────────────────────────────────────┐   │
│  │  Docker Engine                                   │   │
│  │  ┌─────────────┐  ┌─────────────┐               │   │
│  │  │ Task 1      │  │ Task 2      │  ← Share OS   │   │
│  │  │ (container) │  │ (container) │    kernel     │   │
│  │  └─────────────┘  └─────────────┘               │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Fargate (using Firecracker micro-VMs):
┌─────────────────────────────────────────────────────────┐
│  AWS-managed infrastructure                             │
│  ┌───────────────────┐    ┌───────────────────┐        │
│  │ micro-VM 1        │    │ micro-VM 2        │        │
│  │ ┌───────────────┐ │    │ ┌───────────────┐ │        │
│  │ │ Minimal Linux │ │    │ │ Minimal Linux │ │        │
│  │ │ Kernel        │ │    │ │ Kernel        │ │        │
│  │ ├───────────────┤ │    │ ├───────────────┤ │        │
│  │ │ Container     │ │    │ │ Container     │ │        │
│  │ └───────────────┘ │    │ └───────────────┘ │        │
│  └───────────────────┘    └───────────────────┘        │
│           ↑                        ↑                    │
│           └── Separate kernels, fully isolated ─────────┘
└─────────────────────────────────────────────────────────┘

Fargate uses micro-VMs for multi-tenant security - your task can’t access other customers’ tasks.

Task Lifecycle

PROVISIONING → PENDING → RUNNING → STOPPED
     │            │          │         │
     │            │          │         └─ Container exited or stopped
     │            │          └─ Containers running
     │            └─ Waiting for resources
     └─ Preparing to launch

EventBridge Task State Detection

ECS sends task state change events to EventBridge.

{
  "source": "aws.ecs",
  "detail-type": "ECS Task State Change",
  "detail": {
    "lastStatus": "STOPPED",
    "stoppedReason": "Essential container in task exited",
    "containers": [{ "name": "web", "exitCode": 1 }]
  }
}

EventBridge rule pattern:

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "lastStatus": ["STOPPED"]
  }
}

Kinesis Data Streams

Collect and process large amounts of real-time data (logs, events, clicks, IoT data).

Producers                    Kinesis Data Stream                 Consumers
┌─────────┐                 ┌─────────────────────┐             ┌─────────┐
│ App 1   │────►            │                     │        ────►│ Lambda  │
│ App 2   │────►  records   │   Stream            │  records  ─►│ EC2 App │
│ IoT     │────►            │                     │        ────►│ Firehose│
└─────────┘                 └─────────────────────┘             └─────────┘

Data stays in stream for 24 hours (default) up to 365 days. Multiple consumers can read same data.

Shard

A shard is a unit of capacity. More shards = more throughput.

Kinesis Data Stream (3 shards)
┌─────────────────────────────────────────────────────┐
│  ┌─────────────────┐  Shard 1: 1 MB/s in, 2 MB/s out│
│  │     Shard 1     │                               │
│  └─────────────────┘                               │
│  ┌─────────────────┐  Shard 2: 1 MB/s in, 2 MB/s out│
│  │     Shard 2     │                               │
│  └─────────────────┘                               │
│  ┌─────────────────┐  Shard 3: 1 MB/s in, 2 MB/s out│
│  │     Shard 3     │                               │
│  └─────────────────┘                               │
│  Total: 3 MB/s in, 6 MB/s out                      │
└─────────────────────────────────────────────────────┘

Per shard limits:

DirectionLimit
Write (in)1 MB/sec or 1,000 records/sec
Read (out)2 MB/sec

Partition key determines which shard receives each record (hash-based).

Record with partition_key="user123"
        ↓
    hash("user123") → Falls into Shard 2's range
        ↓
    Record stored in Shard 2

Same partition key → same shard → ordered processing for that key.

Enhanced Fan-Out

Gives each consumer dedicated throughput instead of sharing.

Standard (shared):
Shard ──────────────────────────────────────────────
              2 MB/sec total shared
              ┌──────────┼──────────┐
              ▼          ▼          ▼
         Consumer A  Consumer B  Consumer C
         ~0.67 MB/s  ~0.67 MB/s  ~0.67 MB/s

Enhanced Fan-Out (dedicated):
Shard ──────────────────────────────────────────────
              2 MB/sec each dedicated
              ┌──────────┼──────────┐
              ▼          ▼          ▼
         Consumer A  Consumer B  Consumer C
         2 MB/sec    2 MB/sec    2 MB/sec
StandardEnhanced Fan-Out
Throughput per shard2 MB/sec shared2 MB/sec per consumer
DeliveryPull (GetRecords)Push (SubscribeToShard)
Latency~200ms~70ms
Consumer registrationNot neededRequired
ARN usedStream ARNConsumer ARN

Standard mode: No consumer registration needed. GetRecords API and Lambda use stream ARN directly.

Enhanced Fan-Out: Must register consumer first, then use consumer ARN.

# Standard - no registration, use stream ARN
aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream \
  --starting-position LATEST

# Enhanced Fan-Out - register first, then use consumer ARN
aws kinesis register-stream-consumer \
  --stream-arn arn:aws:kinesis:us-east-1:123456789:stream/my-stream \
  --consumer-name my-consumer

aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream/consumer/my-consumer:123 \
  --starting-position LATEST

Batch Size and Batching Window

Control how records are delivered to Lambda.

aws lambda create-event-source-mapping \
  --function-name my-function \
  --event-source-arn arn:aws:kinesis:...:stream/my-stream \
  --batch-size 100 \
  --maximum-batching-window-in-seconds 30 \
  --starting-position LATEST

Lambda invokes when EITHER condition is met:

  • batch-size records collected (default: 100, max: 10,000)
  • maximum-batching-window-in-seconds passed (default: 0, max: 300)
Records in 30 secWhat happens
150 recordsInvokes at 100 records (batch size hit first)
50 recordsInvokes at 30 seconds with 50 records (timeout hit first)
0 recordsNo invocation

Lambda Concurrency and Processing Settings

Key concept: 1 invocation = 1 Lambda instance. Multiple concurrent invocations = multiple instances.

Concurrency Quota: 1000 per region (default), which means 1000 Lambda instances at the same time.

Reserved Concurrency

Guarantee and limit concurrency for a specific function.

Without reserved concurrency:
  Function A spike could starve other functions

With reserved concurrency:
  Function A: reserved 100 (guaranteed, max 100)
  Function B: reserved 200 (guaranteed, max 200)
  Function C: unreserved (uses remaining 700)

Set to 0 = function disabled.

ParallelizationFactor

Process one Kinesis/DynamoDB shard with multiple Lambda instances in parallel.

ParallelizationFactor = 1 (default):
Shard 1 ──► Instance 1
Shard 2 ──► Instance 2
Total instances = 2

ParallelizationFactor = 3:
Shard 1 ──► Instance 1, Instance 2, Instance 3
Shard 2 ──► Instance 4, Instance 5, Instance 6
Total instances = shards × factor = 2 × 3 = 6

Max: 10

ReportBatchItemFailures

Retry only failed records, not entire batch.

Without ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry ALL [1,2,3,4,5]

With ReportBatchItemFailures:
Batch [1,2,3,4,5] → record 3 fails → retry from 3: [3,4,5]

How it works:

┌─────────────────────────────────────────────────────────────────┐
│ Lambda Service (AWS managed)                                    │
│                                                                 │
│  1. Pulls records from Kinesis shard                            │
│  2. Invokes your function with batch of records                 │
│  3. Reads your function's return value                          │
│  4. Retries only failed records based on your response          │
└─────────────────────────────────────────────────────────────────┘
         │                              ▲
         │ event.Records                │ return {"batchItemFailures": [...]}
         ▼                              │
┌─────────────────────────────────────────────────────────────────┐
│ Your Lambda Function Code                                       │
│  - Receives records (doesn't pull from Kinesis)                 │
│  - Processes them                                               │
│  - Returns which ones failed                                    │
└─────────────────────────────────────────────────────────────────┘

Enable:

aws lambda update-event-source-mapping \
  --uuid <mapping-uuid> \
  --function-response-types "ReportBatchItemFailures"

Lambda response:

def handler(event, context):
    failures = []
    for record in event['Records']:  # records from Kinesis
        try:
            process(record)
        except:
            failures.append({"itemIdentifier": record['kinesis']['sequenceNumber']})
    return {"batchItemFailures": failures}  # tell Lambda which failed

Kinesis Data Firehose

Fully managed delivery service. No consumer code needed.

Producers ──► Firehose ──► S3 / Redshift / OpenSearch / Splunk / HTTP

When to Use Firehose vs Data Streams

Data StreamsFirehose
PurposeReal-time processingDelivery to storage
You writeConsumer codeNothing
LatencyMilliseconds60+ seconds (buffered)
Retention24h - 365 daysNone (delivers immediately)

Batching

Firehose buffers records and delivers as batched files, not individual records.

Without Firehose:           With Firehose:
Record 1 → file1.json       Record 1 ─┐
Record 2 → file2.json       Record 2 ─┼─► Buffer ──► one-big-file.json
Record 3 → file3.json       Record 3 ─┘
(millions of tiny files)    (fewer, larger files)

Buffer Settings

SettingRangeBehavior
Buffer size1-128 MBFlush when size reached
Buffer interval60-900 secondsFlush when time elapsed

Whichever comes first triggers delivery.

Format Conversion

Firehose can convert JSON to columnar formats automatically:

JSON records ──► Firehose ──► Parquet/ORC files in S3
  • Better for Athena/Redshift queries (faster, cheaper)
  • Requires schema (from AWS Glue Data Catalog)

Optional Lambda Transform

Transform records before delivery:

Producers ──► Firehose ──► Lambda (transform) ──► S3
                              │
                              └── Add fields, filter, convert format
def handler(event, context):
    output = []
    for record in event['records']:
        payload = base64.b64decode(record['data']).decode('utf-8')
        # Transform the data
        transformed = payload.upper()
        output.append({
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(transformed.encode('utf-8')).decode('utf-8')
        })
    return {'records': output}

ECR Image Scanning

ECR scanning analyzes container images for security vulnerabilities (CVEs - Common Vulnerabilities and Exposures).

Two Scanning Options

Basic ScanningEnhanced Scanning
EngineClair (open source)Amazon Inspector
ScopeOS packages onlyOS packages + application dependencies
WhenOn-push or manualContinuous (auto re-scan on new CVEs)
CostFreePay per image scanned

Basic Scanning

Uses Clair scanner. Only scans OS-level packages (apt, yum).

Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip)        │ ← NOT scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...)   │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04)           │ ← Scanned
└─────────────────────────────────────┘
  • Triggered on image push or manual API call
  • Results are static until next scan
  • New CVE discovered tomorrow → won’t know until re-scan

Enhanced Scanning

Uses Amazon Inspector. Scans OS packages AND application dependencies.

Image layers scanned:
┌─────────────────────────────────────┐
│ App code (node_modules, pip)        │ ← Scanned
├─────────────────────────────────────┤
│ OS packages (apt-get install ...)   │ ← Scanned
├─────────────────────────────────────┤
│ Base image (ubuntu:22.04)           │ ← Scanned
└─────────────────────────────────────┘
  • Continuous monitoring - auto re-scans when new CVEs published
  • Integrates with EventBridge for alerts
  • Supports: Java (Maven), JavaScript (npm), Python (pip), Go, .NET

Key Terms

  • CVE: Publicly known vulnerability with unique ID (e.g., CVE-2021-44228 = Log4Shell)
  • Clair: Open-source container vulnerability scanner
  • Amazon Inspector: AWS service for automated vulnerability management

Building Container Images

Two main AWS services for building container images.

CodeBuild

General-purpose build service. Most common for container CI/CD.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  CodeCommit  │────►│  CodeBuild   │────►│     ECR      │
│  (source)    │     │ docker build │     │  (registry)  │
│              │     │ docker push  │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

buildspec.yml example:

version: 0.2
phases:
  pre_build:
    commands:
      - aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
  build:
    commands:
      - docker build -t $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION .
  post_build:
    commands:
      - docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
  • Full control over build process
  • Integrates with CodePipeline
  • Can run tests, multi-stage builds, any custom logic

EC2 Image Builder

Automated image creation service. Can build AMIs or container images.

┌─────────────────────────────────────────────────────────────┐
│  EC2 Image Builder Pipeline                                 │
│                                                             │
│  Recipe ──► Build ──► Test ──► Distribute to ECR            │
└─────────────────────────────────────────────────────────────┘

Container Recipe options:

  1. Use components (no Dockerfile) - Image Builder applies changes to base image
  2. Provide Dockerfile from S3

Key terms:

  • Recipe: Base image + components or Dockerfile
  • Component: Reusable build/test action (install packages, configure, etc.)
  • Pipeline: Automated workflow with schedule

Console steps for container image:

  1. Create Container Recipe - base image + components or Dockerfile S3 path + target ECR repo
  2. Create Infrastructure Configuration - instance type, IAM role, VPC/subnet for build
  3. Create Distribution Settings - target ECR repositories (can be cross-region/cross-account)
  4. Create Pipeline - link recipe + infrastructure + distribution + schedule
  5. Run Pipeline - builds and pushes to ECR

When to Use Which

Use CaseBetter Choice
CI/CD triggered by code commitsCodeBuild
Scheduled golden image buildsEC2 Image Builder
Need component library (CIS benchmarks, etc.)EC2 Image Builder
Custom build logic, tests, multi-stageCodeBuild
Part of CodePipelineCodeBuild

AWS App Runner

Fully managed service to run web apps/APIs. You provide code or container → App Runner handles everything.

You provide:                    App Runner handles:
┌─────────────────┐            ┌─────────────────────────────┐
│ Source code     │            │ Build                       │
│ (GitHub repo)   │───────────►│ Deploy                      │
│       OR        │            │ Scale (auto, including to 0)│
│ Container image │            │ Load balancing              │
│ (ECR)           │            │ HTTPS/TLS certificate       │
└─────────────────┘            │ Health checks               │
                               └─────────────────────────────┘
                                         │
                                         ▼
                               https://abc123.awsapprunner.com

Two Source Types

SourceHow It Works
Source code (GitHub)App Runner builds container automatically
Container image (ECR)App Runner pulls and runs directly

Comparison with Other Compute

App RunnerECS FargateLambda
You manageAlmost nothingTask definitions, services, ALBFunction code
ScalingAutomaticYou configureAutomatic
Min instancesCan scale to 0Min 1 taskN/A (event-driven)
Use caseSimple web appsComplex container workloadsEvent processing
PricingPer vCPU/memory hourPer vCPU/memory hourPer request + duration

Key Features

  • Auto scaling: Based on concurrent requests, can scale to zero
  • Auto deployments: Trigger on ECR push or GitHub commit
  • VPC Connector: Access private resources (RDS, ElastiCache) in VPC
  • Custom domain: Bring your own domain with automatic TLS

When to Use App Runner

  • Simple web apps, APIs, microservices
  • Want zero infrastructure management
  • Don’t need ECS features (service mesh, complex networking)
  • Acceptable to use App Runner’s opinionated defaults

AWS Backup

Centralized service to manage backups across multiple AWS services from one place.

Without AWS Backup:                    With AWS Backup:
┌─────────┐ ┌─────────┐ ┌─────────┐   ┌─────────────────────────────┐
│   EC2   │ │   RDS   │ │   EFS   │   │       AWS Backup            │
│ snapshot│ │ snapshot│ │ backup  │   │  One backup plan for all    │
│ config  │ │ config  │ │ config  │   │  ┌─────┬─────┬─────┐        │
└─────────┘ └─────────┘ └─────────┘   │  │ EC2 │ RDS │ EFS │        │
     ↓           ↓           ↓        │  └─────┴─────┴─────┘        │
  Manage each separately              └─────────────────────────────┘

Supported: EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3, etc.

Core Concepts

ConceptWhat It Is
Backup PlanWhen and how to backup (schedule, retention, copy rules)
Resource AssignmentWhat to backup (by resource ID or tags)
Backup VaultWhere backups are stored (container for recovery points)
Recovery PointThe actual backup data (snapshot, AMI, etc.)

Backup Vault Features

FeaturePurpose
EncryptionAll backups encrypted with KMS key
Access PolicyControl who can backup/restore/delete
Vault LockWORM - prevent deletion even by root (compliance)

Cross-Account Backup Copy

Copy recovery points to another AWS account for disaster recovery.

Source Account (111)                    Destination Account (222)
┌─────────────────────┐                ┌─────────────────────┐
│  Backup Plan        │                │  Backup Vault       │
│  ┌───────────────┐  │                │                     │
│  │ Copy Rule:    │  │   copy         │  Access Policy:     │
│  │ Dest Vault ARN│──┼───────────────►│  Allow 111 to       │
│  └───────────────┘  │                │  CopyIntoBackupVault│
│                     │                │                     │
│  Source Vault       │                │  Recovery Point     │
│  (30 days retention)│                │  (90 days retention)│
└─────────────────────┘                └─────────────────────┘

Setup required:

  1. Source account: Backup plan with copy rule pointing to destination vault ARN
  2. Destination account: Vault access policy allowing backup:CopyIntoBackupVault

Cross-Account KMS Encryption

Behavior depends on whether the service supports “independent encryption” by AWS Backup.

Services WITH independent encryption (DynamoDB advanced, EFS):

  • AWS Backup handles encryption at vault level
  • No KMS key sharing needed

Services WITHOUT independent encryption (RDS, EC2/EBS):

  • Backup encrypted with data source’s KMS key (not vault key)
  • Destination account’s AWSServiceRoleForBackup performs the copy
  • Source KMS key must grant kms:Decrypt to destination account’s service-linked role
{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::222222222222:role/aws-service-role/backup.amazonaws.com/AWSServiceRoleForBackup"
  },
  "Action": ["kms:Decrypt", "kms:CreateGrant"],
  "Resource": "*"
}

Destination vault re-encrypts with its own KMS key → each account controls its own copy independently.

IAM Roles Anywhere

Lets workloads outside AWS (on-premises, other clouds) get temporary AWS credentials using X.509 certificates.

Problem It Solves

MethodIssue
IAM User access keysLong-term, can leak, manual rotation
EC2 Instance ProfileOnly works on EC2

IAM Roles Anywhere = temporary credentials for external workloads.

How It Works

On-Premises Server                           AWS
┌─────────────────────────┐                 ┌─────────────────────────────┐
│                         │                 │  IAM Roles Anywhere         │
│  X.509 Certificate      │   1. Present    │                             │
│  (issued by your CA)    │─────cert───────►│  2. Validate cert against   │
│                         │                 │     Trust Anchor (your CA)  │
│                         │◄──temp creds────│  3. Return temporary        │
│  AWS CLI / SDK          │                 │     credentials for Role    │
└─────────────────────────┘                 └─────────────────────────────┘

Key Components

ComponentWhat It Is
Trust AnchorYour CA that AWS trusts (own CA or AWS Private CA)
ProfileLinks Trust Anchor to IAM Role(s)
RoleIAM role with trust policy for rolesanywhere.amazonaws.com
X.509 CertificateInstalled on server, issued by your CA

Credential Helper Usage

# Direct command
aws_signing_helper credential-process \
  --certificate /path/to/cert.pem \
  --private-key /path/to/key.pem \
  --trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
  --profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
  --role-arn arn:aws:iam::111111111111:role/MyRole
# ~/.aws/config
[profile onprem]
credential_process = aws_signing_helper credential-process \
  --certificate /path/to/cert.pem \
  --private-key /path/to/key.pem \
  --trust-anchor-arn arn:aws:rolesanywhere:us-east-1:111111111111:trust-anchor/abc \
  --profile-arn arn:aws:rolesanywhere:us-east-1:111111111111:profile/xyz \
  --role-arn arn:aws:iam::111111111111:role/MyRole

Then use: aws s3 ls --profile onprem

EFS (Elastic File System)

Managed NFS file system that multiple EC2 instances can access simultaneously.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  EC2 (AZ-a) │     │  EC2 (AZ-b) │     │  EC2 (AZ-c) │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │ NFS protocol (port 2049)
                           ▼
              ┌─────────────────────────┐
              │         EFS             │
              │   /shared-files/        │
              └─────────────────────────┘
  • Shared storage: Multiple instances read/write same files
  • Auto-scaling: Grows/shrinks automatically
  • Protocol: NFS v4 (Linux only)
  • Mount: sudo mount -t nfs4 fs-xxx.efs.region.amazonaws.com:/ /mnt/efs

On-Premises Access

On-prem servers can mount EFS over Direct Connect or VPN.

On-Premises ──── Direct Connect/VPN ──── VPC ──── EFS

FSx (Managed File Systems)

Managed file systems for specific use cases.

FSx TypeProtocolUse Case
FSx for Windows File ServerSMBWindows workloads, Active Directory
FSx for LustreLustreHigh-performance computing, ML
FSx for NetApp ONTAPNFS, SMB, iSCSIEnterprise, multi-protocol
FSx for OpenZFSNFSLinux workloads needing ZFS features

EFS vs FSx:

  • EFS = Simple NFS for Linux
  • FSx = Specialized file systems (Windows, HPC, enterprise)

Site-to-Site VPN

Encrypted tunnel over public internet connecting on-premises to AWS VPC.

On-Premises                                    AWS
┌─────────────────┐                           ┌─────────────────┐
│  Your Router    │      Public Internet      │  Virtual Private│
│  (Customer GW)  │───── Encrypted Tunnel ────│  Gateway (VGW)  │
│  10.0.0.0/16    │                           │  172.31.0.0/16  │
└─────────────────┘                           └─────────────────┘

Components

ComponentWhat It Is
Customer Gateway (CGW)AWS resource representing your on-prem router
Virtual Private Gateway (VGW)VPN endpoint attached to one VPC
VPN ConnectionLinks CGW ↔ VGW, creates two tunnels for redundancy

How VPN Works (Encapsulation)

VPN wraps original packet inside encrypted outer packet. Original private IPs preserved.

Original: src=10.0.1.50 dst=172.31.1.100

After VPN encapsulation:
┌─────────────────────────────────────────────────────────────┐
│ Outer: src=203.0.113.50 dst=52.x.x.x (AWS VPN endpoint)     │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ENCRYPTED: src=10.0.1.50 dst=172.31.1.100 (preserved)   │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Routing Required

Both sides need routes pointing to VPN:

On-prem router:     172.31.0.0/16 → VPN tunnel
VPC route table:    10.0.0.0/16   → vgw-xxxxx

VPN vs Direct Connect

AspectVPNDirect Connect
ConnectionOver public internetDedicated physical cable
Setup timeMinutesWeeks to months
CostLowHigh
BandwidthUp to ~1.25 Gbps1-100 Gbps
LatencyVariableConsistent
EncryptionBuilt-in (IPsec)Not by default

Transit Gateway (TGW)

Hub connecting multiple VPCs and on-premises networks.

                    ┌─────┐ ┌─────┐ ┌─────┐
                    │VPC-A│ │VPC-B│ │VPC-C│
                    └──┬──┘ └──┬──┘ └──┬──┘
                       └──────┼───────┘
                              │
                    ┌─────────▼─────────┐
                    │  Transit Gateway  │
                    └─────────┬─────────┘
                              │
                    ┌─────────┴─────────┐
                    │                   │
                    ▼                   ▼
              VPN to On-Prem     Direct Connect
  • Central hub - add new VPCs easily
  • VPN/Direct Connect connects once to TGW, reaches all VPCs
  • Route tables control which networks can communicate

S3 Event Notifications

Triggers actions when events happen in S3 bucket.

S3 Bucket ──► Event Notification ──► Lambda / SQS / SNS / EventBridge

Event Types

CategoryExamples
Object createds3:ObjectCreated:Put, s3:ObjectCreated:Copy
Object removeds3:ObjectRemoved:Delete
Replications3:Replication:OperationFailedReplication
Lifecycles3:LifecycleExpiration:*, s3:LifecycleTransition
Restores3:ObjectRestore:Completed

S3 Notifications vs EventBridge

S3 Event NotificationsS3 → EventBridge
DestinationsLambda, SQS, SNS only20+ AWS services
FilteringPrefix/suffix onlyAdvanced (metadata, size)

S3 Batch Operations

Run operations on billions of objects at once.

Manifest (list of objects) ──► S3 Batch Job ──► Operation on all objects

Operations

OperationUse Case
CopyMigrate objects to another bucket
Invoke LambdaCustom processing per object
Replace tagsBulk update tags
Restore from GlacierBulk restore archived objects
DeleteBulk delete

DAX (DynamoDB Accelerator)

In-memory cache for DynamoDB. Microsecond latency for reads.

Application
     │
     │ Same DynamoDB API
     ▼
┌─────────────┐
│    DAX      │ ← Microsecond (cache hit)
│   Cluster   │
└──────┬──────┘
       │ Cache miss
       ▼
┌─────────────┐
│  DynamoDB   │ ← Millisecond
└─────────────┘
  • API-compatible with DynamoDB (just change endpoint)
  • Use case: Read-heavy workloads needing microsecond latency

RDS Proxy

Connection pooler for RDS/Aurora. Solves connection exhaustion.

Lambda (100s concurrent)
     │ │ │ │ │
     ▼ ▼ ▼ ▼ ▼
┌─────────────────┐
│   RDS Proxy     │ ← Pools connections
└────────┬────────┘
         │ Few persistent connections
         ▼
┌─────────────────┐
│   RDS / Aurora  │
└─────────────────┘
  • Problem: Lambda spawns many connections, DB has limits
  • Solution: Proxy reuses connections from pool
  • Bonus: Faster failover for Aurora

DAX vs RDS Proxy

DAXRDS Proxy
ForDynamoDBRDS / Aurora
PurposeCaching (latency)Connection pooling

AWS Service Catalog

Catalog of approved, pre-configured AWS resources for users to deploy.

Admin creates Products ──► Users see approved products only ──► Launch
(CloudFormation templates)     (from shared Portfolios)

Key Concepts

TermWhat It Is
ProductCloudFormation template packaged for deployment
PortfolioCollection of products, shared with users/accounts
ConstraintRules (allowed parameters, launch role)

Restrictions

WhatHow
Allowed regionsPortfolio exists only in allowed regions
Allowed parametersTemplate Constraint or AllowedValues in template
PermissionsLaunch Constraint (IAM role used to deploy)

Template Constraint Example

{
  "Rules": {
    "InstanceTypeRule": {
      "Assertions": [{
        "Assert": {
          "Fn::Contains": [["t3.micro", "t3.small"], {"Ref": "InstanceType"}]
        },
        "AssertDescription": "Only t3.micro or t3.small allowed"
      }]
    }
  }
}

CloudFormation Custom Resource

Run your own Lambda code during stack operations. For things CloudFormation doesn’t natively support.

CloudFormation ──► Your Lambda ──► Does custom work ──► Reports back

Syntax

Resources:
  MyCustomResource:
    Type: Custom::AnyNameYouWant      # "Custom::" prefix required
    Properties:
      ServiceToken: !GetAtt MyLambda.Arn   # Required: Lambda ARN
      CustomParam1: value1                  # Your custom inputs
      CustomParam2: value2

Lambda Receives

{
  "RequestType": "Create",           
  "ResourceProperties": {
    "CustomParam1": "value1",
    "CustomParam2": "value2"
  },
  "ResponseURL": "https://..."       
}

Lambda Must

  1. Check RequestType (Create, Update, Delete)
  2. Do the work
  3. Send success/failure to ResponseURL

Use Cases

  • Create resources in other regions
  • Call external APIs during deployment
  • Complex logic CloudFormation can’t express

Kubernetes Namespace

Virtual cluster division within a Kubernetes cluster. Groups and isolates resources.

┌─────────────────────────────────────────────────────────────────┐
│                        EKS Cluster                               │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: default                                       │    │
│  │  Deployment: web       Service: web-service              │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: production                                    │    │
│  │  Deployment: api       ConfigMap: prod-config            │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Namespace: kube-system  (Kubernetes internal)            │    │
│  │  ConfigMap: aws-auth    DaemonSet: kube-proxy            │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Use Cases

PurposeExample
Environment separationdev, staging, production namespaces
Team separationteam-a, team-b namespaces
Access controlTeam A can only access team-a namespace

DNS with Namespaces

Service DNS: <service-name>.<namespace>.svc.cluster.local

Examples:
- api-service.default.svc.cluster.local
- api-service.production.svc.cluster.local

Container Insights

CloudWatch feature that collects metrics and logs from containerized applications (ECS, EKS).

How It Works (EKS)

┌─────────────────────────────────────────────────────────────────┐
│                        EKS Cluster                               │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Node 1       │  │ Node 2       │  │ Node 3       │           │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │           │
│  │ │CloudWatch│ │  │ │CloudWatch│ │  │ │CloudWatch│ │           │
│  │ │Agent     │ │  │ │Agent     │ │  │ │Agent     │ │           │
│  │ │(DaemonSet)│ │  │ │(DaemonSet)│ │  │ │(DaemonSet)│ │          │
│  │ └────┬─────┘ │  │ └────┬─────┘ │  │ └────┬─────┘ │           │
│  └──────┼───────┘  └──────┼───────┘  └──────┼───────┘           │
│         └─────────────────┼─────────────────┘                    │
│                           ▼                                      │
│                  CloudWatch Metrics                              │
│                  (namespace: ContainerInsights)                  │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

MetricWhat it measures
pod_memory_utilization% of memory limit used
pod_cpu_utilization% of CPU limit used
pod_memory_working_setActual bytes in use

Dimensions

Filter metrics by:

  • ClusterName
  • Namespace (Kubernetes namespace)
  • Service (Kubernetes Service name)
  • PodName
  • NodeName

AWS Glue Crawler

Automatically scans data sources and creates table definitions in Glue Data Catalog.

┌─────────────────────────────────────────────────────────────────┐
│  S3 Bucket (/data/)                                             │
│    sales.csv                                                    │
│    orders.json                                                  │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────┐  Detects:                                      │
│  │   Crawler   │  - File format (CSV, JSON, Parquet)            │
│  │             │  - Column names                                 │
│  │             │  - Data types                                   │
│  │             │  - Partitions (year=2024/month=01/)            │
│  └──────┬──────┘                                                │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Glue Data Catalog                                        │    │
│  │  Database: my_database                                   │    │
│  │  ├── Table: sales (id, product, amount)                 │    │
│  │  └── Table: orders (order_id, customer)                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│         │                                                        │
│         ▼                                                        │
│  Now queryable with Athena:                                     │
│  SELECT * FROM my_database.sales WHERE amount > 100             │
└─────────────────────────────────────────────────────────────────┘

AWS Glue ETL

Serverless data transformation jobs. ETL = Extract, Transform, Load.

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Extract    │    │  Transform  │    │    Load     │
│             │    │             │    │             │
│ Read from   │ →  │ Clean,      │ →  │ Write to    │
│ S3, RDS     │    │ filter,     │    │ S3, Redshift│
│             │    │ join        │    │             │
└─────────────┘    └─────────────┘    └─────────────┘

Key Points

AspectDescription
ServerlessPay per second of job runtime
EngineApache Spark
Write inPython (PySpark) or Scala
TriggersOn-demand, scheduled, or event-based

Glue Components Together

S3 (raw) ──► Crawler ──► Data Catalog ──► ETL Job ──► S3 (clean)
                              │
                              ▼
                          Athena (query)

SigV4 (Signature Version 4)

AWS’s method for authenticating API requests. Every AWS API call must be signed.

What It Proves

  • You have valid AWS credentials
  • Request hasn’t been modified in transit
  • Request is recent (not replay attack)

The 4 Steps

1. Create Canonical Request
   - Standardize HTTP method, path, headers, body hash

2. Create String to Sign
   - Algorithm + timestamp + scope + hash of step 1

3. Calculate Signing Key
   - Chain HMAC-SHA256 from Secret Key → date → region → service

4. Calculate Signature
   - HMAC(signing_key, string_to_sign)

Result: Authorization Header

Authorization: AWS4-HMAC-SHA256
  Credential=AKIAIOSFODNN7EXAMPLE/20241229/us-east-1/s3/aws4_request,
  SignedHeaders=host;x-amz-date,
  Signature=abc123def456...

You don’t do this manually. AWS SDKs and CLI handle it automatically.


CodeArtifact Domain

Container that groups multiple repositories. Provides shared storage, permissions, encryption.

┌─────────────────────────────────────────────────────────────────┐
│                    CodeArtifact Domain                           │
│                    (name: my-company)                            │
│                                                                  │
│  Shared: KMS key, IAM policies, deduplication                   │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Repository:     │  │ Repository:     │  │ Repository:     │  │
│  │ npm-prod        │  │ npm-dev         │  │ python-internal │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Upstream Repository

Repository that another repository pulls from when package not found locally.

Developer: npm install lodash
     │
     ▼
my-npm-repo (not found) ──► npm-public-proxy (not found) ──► npmjs.org
     │                              │                            │
     │◄─────────────────────────────┼────────────────────────────┘
     │         Package cached at each level

Benefits

  • Single endpoint for internal + public packages
  • Caching (faster installs, works if npmjs down)
  • Audit all package downloads

FSx Types Comparison

FSx TypeProtocolBest For
Windows File ServerSMBWindows apps, Active Directory
LustreLustreHPC, ML, high-throughput
NetApp ONTAPNFS, SMB, iSCSIEnterprise, multi-protocol
OpenZFSNFSLinux workloads, snapshots

Key Differences

WindowsLustreNetApp ONTAPOpenZFS
OS supportWindowsLinux onlyAllLinux, macOS
AD requiredYesNoOptionalNo
S3 integrationNoYes (native)NoNo
Multi-protocolNoNoYesNo
SnapshotsShadow copiesNoYesYes
Multi-AZYesNoYesNo

EFS vs FSx for OpenZFS

Both are NFS for Linux, different design goals.

AspectEFSFSx for OpenZFS
CapacityAuto-scalesYou provision
PerformanceScales with sizeUp to 1M IOPS
LatencyMillisecondsSub-millisecond
SnapshotsNoYes (instant)
ClonesNoYes (instant)
Multi-AZYesNo
Best forShared storage, CMSDatabases, analytics

AWS Storage Gateway

Hybrid storage connecting on-premises to AWS cloud storage.

┌─────────────────────────────────────────────────────────────────┐
│  On-Premises                                                     │
│                                                                  │
│  Application ──NFS/SMB/iSCSI──► Storage Gateway ──► AWS (S3,    │
│                                 (VM or hardware)     EBS,       │
│                                 Local cache          Glacier)   │
└─────────────────────────────────────────────────────────────────┘

Gateway Types

TypeProtocolBackendUse Case
S3 File GatewayNFS, SMBS3File shares backed by S3
FSx File GatewaySMBFSx for WindowsLow-latency FSx access
Volume GatewayiSCSIS3 + EBSBlock storage, DR
Tape GatewayiSCSI (VTL)S3, GlacierBackup (replaces tapes)

EFS Mount Target

Network endpoint (ENI) in a specific AZ for EC2 to connect to EFS.

┌─────────────────────────────────────────────────────────────────┐
│                           VPC                                    │
│                                                                  │
│  ┌─────────────────────────┐    ┌─────────────────────────┐     │
│  │      AZ-a               │    │      AZ-b               │     │
│  │                         │    │                         │     │
│  │  EC2 ──► Mount Target   │    │  EC2 ──► Mount Target   │     │
│  │          (ENI)          │    │          (ENI)          │     │
│  │          10.0.1.25      │    │          10.0.2.30      │     │
│  └────────────┬────────────┘    └────────────┬────────────┘     │
│               └────────────┬─────────────────┘                   │
│                            ▼                                     │
│                          EFS                                     │
└─────────────────────────────────────────────────────────────────┘

Key Points

  • One mount target per AZ (for low latency, no cross-AZ costs)
  • Has its own security group (allow NFS port 2049)
  • EFS DNS resolves to nearest mount target

AWS SAM (Serverless Application Model)

Framework for building serverless applications. Simplified CloudFormation + CLI tools.

SAM Template

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31    # ← SAM marker

Resources:
  MyFunction:
    Type: AWS::Serverless::Function       # ← SAM resource
    Properties:
      Handler: index.handler
      Runtime: python3.11
      CodeUri: ./src
      Events:
        Api:
          Type: Api
          Properties:
            Path: /hello
            Method: GET

Automatically creates: Lambda + API Gateway + IAM role + permissions

SAM CLI Commands

sam init          # Create new project
sam build         # Install dependencies
sam local invoke  # Run Lambda locally
sam local start-api  # Local API Gateway
sam deploy        # Deploy to AWS

cloudformation package

Uploads local files to S3 and rewrites template with S3 URLs.

BEFORE (template.yaml):
  Code: ./src              ← Local path

        │
        │ aws cloudformation package \
        │   --template-file template.yaml \
        │   --s3-bucket my-bucket \
        │   --output-template-file packaged.yaml
        ▼

AFTER (packaged.yaml):
  Code:
    S3Bucket: my-bucket    ← S3 reference
    S3Key: abc123...

Workflow

# 1. Package: upload to S3, generate new template
aws cloudformation package \
  --template-file template.yaml \
  --s3-bucket my-bucket \
  --output-template-file packaged.yaml

# 2. Deploy: use packaged template
aws cloudformation deploy \
  --template-file packaged.yaml \
  --stack-name my-stack

deploy reads local packaged.yaml file. S3 bucket/key is embedded in the template.


Trusted Advisor Service Limits

Checks that compare your current AWS resource usage against default service quotas (limits).

What It Does

Trusted Advisor Service Limits Check:

  Your Usage          Service Quota          Status
  ─────────────────────────────────────────────────
  45 EC2 instances    50 (default limit)     ⚠️ 90% - Yellow (warning)
  3 VPCs              5 (default limit)      ✓ 60% - Green (OK)
  5 Elastic IPs       5 (default limit)      🔴 100% - Red (at limit)

Status Thresholds

StatusMeaning
🟢 GreenUsage < 80% of limit
🟡 YellowUsage ≥ 80% of limit (warning)
🔴 RedUsage ≥ 100% of limit (at or over)

Example Checks

  • EC2 On-Demand instances (per instance type, per region)
  • VPCs per region
  • Elastic IP addresses
  • EBS volumes
  • RDS instances
  • IAM roles, users, groups
  • S3 buckets
  • Lambda concurrent executions
  • Auto Scaling groups

Important Limitations

LimitationDetail
Default quotas onlyDoesn’t know about quota increases you’ve requested
Not real-timeRefreshes periodically (manual refresh available)
Subset of servicesDoesn’t cover all AWS services/quotas
Basic SupportService Limits checks are free (unlike most Trusted Advisor checks)

Service Limits vs Service Quotas

Trusted Advisor Service LimitsService Quotas (service)
PurposeMonitor usage vs limitsView/request quota increases
Shows current usageYesYes
Shows applied quotasNo (default only)Yes (actual applied quota)
Request increasesNoYes
APIsupport:DescribeTrustedAdvisorChecksservice-quotas:*

Better Alternative: Service Quotas + CloudWatch

For accurate monitoring including custom quota increases:

Service Quotas → CloudWatch Metrics → CloudWatch Alarm

  Metric: AWS/Usage → ResourceCount
  Alarm: When usage > 80% of AppliedQuota

This reflects your actual quota (including increases), not just defaults.

When Trusted Advisor Service Limits Is Useful

  • Quick overview across many services
  • Accounts without quota increases (defaults apply)
  • Free tier / Basic Support accounts (Service Limits checks are free)

cfn-init and cfn-hup

CloudFormation helper scripts that run on EC2 instances to configure them based on metadata in your template.

ScriptWhat it does
cfn-initReads metadata from template, configures the instance (install packages, create files, run commands)
cfn-hupDaemon that watches for metadata changes and re-runs cfn-init when template is updated

hup = HangUP signal (SIGHUP). In Unix, sending SIGHUP to a daemon tells it to reload configuration. cfn-hup = daemon that watches for config changes and reloads.

The Problem They Solve

UserData (imperative):              cfn-init (declarative):
─────────────────────               ────────────────────────
yum install -y httpd                packages:
systemctl start httpd                 yum:
echo "hello" > /var/www/html/index    httpd: []
                                    services:
                                      sysvinit:
                                        httpd: {enabled: true}
                                    files:
                                      /var/www/html/index.html:
                                        content: "hello"

Where Configuration Lives

In the Metadata section of your EC2 resource in the CloudFormation template:

Resources:
  MyInstance:
    Type: AWS::EC2::Instance
    Metadata:                          # ← cfn-init reads this
      AWS::CloudFormation::Init:
        config:
          packages:
            yum:
              httpd: []
          files:
            /var/www/html/index.html:
              content: "Hello World"
          services:
            sysvinit:
              httpd:
                enabled: true
                ensureRunning: true
    Properties:
      ImageId: ami-xxxxx
      UserData:                        # ← Calls cfn-init
        Fn::Base64: !Sub |
          #!/bin/bash
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}

cfn-init Configuration Sections

SectionWhat it configures
packagesInstall packages (yum, apt, rpm, python, rubygems)
groupsCreate Linux groups
usersCreate Linux users
sourcesDownload and extract archives (tar, zip)
filesCreate files with content, permissions, owner
commandsRun shell commands
servicesEnable/start/stop services (sysvinit, systemd)

Execution order: packages → groups → users → sources → files → commands → services

cfn-hup Configuration Files

Two files needed on the instance:

FilePurpose
/etc/cfn/cfn-hup.confMain config: which stack to watch, poll interval
/etc/cfn/hooks.d/*.confHook definitions: what to run when changes detected
files:
  /etc/cfn/cfn-hup.conf:
    content: !Sub |
      [main]
      stack=${AWS::StackId}
      region=${AWS::Region}
      interval=5
    mode: "000400"
    owner: root
    group: root
  
  /etc/cfn/hooks.d/cfn-auto-reloader.conf:
    content: !Sub |
      [cfn-auto-reloader-hook]
      triggers=post.update
      path=Resources.MyInstance.Metadata.AWS::CloudFormation::Init
      action=/opt/aws/bin/cfn-init -s ${AWS::StackName} -r MyInstance --region ${AWS::Region}
      runas=root
    mode: "000400"
    owner: root
    group: root

services:
  sysvinit:
    cfn-hup:
      enabled: true
      ensureRunning: true
      files:
        - /etc/cfn/cfn-hup.conf
        - /etc/cfn/hooks.d/cfn-auto-reloader.conf

What Actually Happens

Initial Deployment (cfn-init):

CloudFormation creates EC2 instance
        ↓
EC2 boots, runs UserData script
        ↓
UserData calls: /opt/aws/bin/cfn-init -s MyStack -r MyInstance
        ↓
cfn-init fetches Metadata from CloudFormation API
        ↓
cfn-init executes: packages → files → commands → services
        ↓
Instance configured and running

Stack Update (cfn-hup):

You update CloudFormation template (change Metadata)
        ↓
CloudFormation updates stack
        ↓
cfn-hup daemon polls every N minutes, detects change
        ↓
cfn-hup runs action: /opt/aws/bin/cfn-init ...
        ↓
cfn-init re-applies configuration

Summary

ScriptWhen it runsPurpose
cfn-initOnce at instance launch (from UserData)Initial configuration
cfn-hupContinuously as daemonDetect metadata changes, re-run cfn-init

Without cfn-hup: Metadata changes require instance replacement or manual intervention.

With cfn-hup: Instance automatically reconfigures itself when you update the stack.


VPC Endpoint: Private connection from your VPC to a service (no internet needed).

PrivateLink: The underlying AWS technology that powers VPC endpoints.

Two Types of VPC Endpoints

TypeWhat it connects toHow it worksCost
Gateway EndpointS3, DynamoDB onlyRoute table entry (no ENI)Free
Interface EndpointMost AWS services + your own servicesCreates ENI in your subnet~$0.01/hr + data
┌─────────────────────────────────────────────────────────────────┐
│  Your VPC                                                       │
│                                                                 │
│  Gateway Endpoint (S3/DynamoDB):                                │
│    - Entry in route table                                       │
│    - No ENI, no IP address                                      │
│    - Free                                                       │
│                                                                 │
│  Interface Endpoint (PrivateLink):                              │
│    - Creates ENI with private IP                                │
│    - Works for 100+ AWS services                                │
│    - Works for your own services (via NLB)                      │
└─────────────────────────────────────────────────────────────────┘

Is NLB Needed?

Connecting toNLB needed?
AWS services (S3, SQS, Lambda, etc.)No—AWS manages it
Your own service in another VPC/accountYes—you create NLB + Endpoint Service

Cross-Account Connectivity

VPC endpoints connect to services, not VPCs directly. To connect to another VPC/account:

Provider Account                          Consumer Account
┌─────────────────────────────┐          ┌─────────────────────────────┐
│                             │          │                             │
│  App ← NLB ← Endpoint       │◄─────────│  Interface    → App         │
│             Service         │PrivateLink│  Endpoint                   │
│                             │          │                             │
└─────────────────────────────┘          └─────────────────────────────┘

Provider creates: NLB + Endpoint Service + allow consumer accounts

Consumer creates: Interface Endpoint using provider’s service name

Setup Flow

# Provider: Create endpoint service
aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns <nlb-arn> \
  --acceptance-required
# Returns: ServiceName: com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx

# Provider: Allow consumer account
aws ec2 modify-vpc-endpoint-service-permissions \
  --service-id vpce-svc-xxxxxxxxx \
  --add-allowed-principals arn:aws:iam::222222222222:root

# Consumer: Create interface endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-consumer \
  --service-name com.amazonaws.vpce.us-east-1.vpce-svc-xxxxxxxxx \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-aaa

# Provider: Accept connection (if acceptance-required)
aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-xxxxxxxxx \
  --vpc-endpoint-ids vpce-xxxxxxxxx

How Consumer Routes Requests

Consumer app uses endpoint DNS or ENI private IP:

# Endpoint DNS (auto-provided)
curl http://vpce-xxx.vpce-svc-xxx.us-east-1.vpce.amazonaws.com

# Or ENI private IP directly
curl http://10.1.0.50

No route table changes needed—Interface Endpoint ENI handles routing automatically.

Summary

Want to connect to…What you need
S3 / DynamoDBGateway Endpoint (free, same region)
AWS services (SQS, Lambda, etc.)Interface Endpoint
Another VPC/account’s appInterface Endpoint → their Endpoint Service + NLB