↖↑↓ SysOps Administrator Associate
5/2018 - 9/2018
↖↑↓ Monitoring And Metrics
↖↑↓ Virtualization Types
Linux Amazon Machine Images use one of two types of virtualization:
AMI |
Type |
Effect |
PV |
Paravirtual |
Historically better performance than HVM, but no longer the case |
HVM |
Hardware virtual machine |
More modern, same or better performance than PV |
↖↑↓ EC2 Instance Types
General Purpose |
Balance of computer, memory and networking |
M5 (2017) |
* Require HVM AMIs * Instance store via EBS or NVMe SSD (physically connected to to the host server) |
M4 (2015) |
* Allows enhanced networking * EBS-optimized |
M3 (2012) |
* SSD (instance) store |
T3 (2018) |
* 30% better price performance |
T2 (2014) |
* Intented for workloads that do not use the full CPU constantly (e.g. web server) * Allows burstable performance * Burst credits allow to 'burst' past the baseline performance up to 100% * 1 credit = 100% load per core per minute * Credits are earned per hour, expire after 24h * EBS storage only |
Compute optimized |
Lowest prize for compute performance |
C5 (2016) |
* Intel Skylake * Use Nitro, Amazon’s lightweight hardware accelerated hypervisor * Better performance and pricing than C4 |
C4 (2015) |
* Intel Haswell * Optimized for EC2 * Allows enhanced networking and clustering * EBS-optimized |
C3 (2013) |
* SSD (instance) store * Allows enhanced networking and clustering |
Memory optimized |
Lowest prize for memory performance |
Z1d (2018) |
* Offer both high compute capacity and a high memory footprint * Ideal for workloads with high per-core licensing costs |
X1 (2016) |
* One of the lowest price per GiB of RAM * SSD storage and EBS-optimized by default * X1e has even more RAM |
R5 (2018) |
* Use Nitro, Amazon’s lightweight hardware accelerated hypervisor |
R4 (2016) |
* Improved networking and EBS performance |
R3 (2014) |
* SSD (instance) store * High memory capacity * Allows enhanced networking |
GPU optimized |
. |
P3 (2017) |
* Faster than P2 |
P2 (2016) |
* Intended for general-purpose GPU compute applications |
G3 (2017) |
* Optimized for graphics-intensive applications * Faster then G2 |
G2 (2013) |
* High frequency processors * High-performce NVIDIA GPUs |
Storage optimized |
Very fast SSD-backed instance storage optimized for high random I/O and high IOPS |
H1 (2017) |
* HDD-based local storage * deliver high disk throughput * Balance of compute and memory |
I3 (2016) |
* (NVMe) SSD-backed instance storage optimized for low latency * very high random I/O performance |
D2 (2015) |
* Lowest price per disk throughput performance |
I2 (2013) |
* SSD (instance) store * Allows enhanced networking * Supports TRIM (more efficient SSD operations) |
RDS instance types |
Optimized to fit different relational database use cases |
db. |
General purpose, memory optimized, burstable performance |
.*
↖↑↓ EC2 Monitoring
EC2 Status Checks
- AWS performs automated checks on every running EC2 instance
- Performed every minute
- Each returns a pass or a fail status
System Status Check
- Loss of network connectivity
- Loss of system power
- Hardware/software issues on physical host
- Solution
- Stop and start instance
- Terminate and re-launch instance
- Contact AWS
- Can configure for auto-recovery
- Instance will be rebooted and retain instance id, (e)ip address, EBS volumes et al
Instance Status Check
- Failed system status check
- Network/startup configuration issues
- Memory/disk problems
- Kernel compatability issues
- Solution
- Fix problem
- Stop and start instance
- Terminate and re-launch instance, potentially with more memory/network/disk/...
↖↑↓ EBS Monitoring
EBS Status Checks
IOPS (Input/Output Operations Per Second) is a common performance measurement used to benchmark
computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area
networks (SAN).
- I/O size is capped at 256 KiB for SSD volumes and 1,024 KiB for HDD volumes because SSD volumes handle
small or random I/O much more efficiently than HDD volumes.
- SSDs deliver constant performance for both random and sequential I/O
- HDDs have optimal performance for large and sequential I/O
- HDD can deliver more throughput put drastically less IOPS
. |
gp2 |
io1 |
st1 |
sc1 |
Volume type |
General purpose SSD |
Provisioned IOPS SSD |
Throughput optimized HDD |
Cold HDD |
Purpose |
Balances price and performance |
For mission-critical low-latency or high-throughput workloads |
Low cost HDD volume designed for frequently accessed, throughput-intensive workloads |
Lowest cost HDD volume designed for less frequently accessed workloads |
Volume Size |
1 GiB - 16 TiB |
4 GiB - 16 TiB |
500 GiB - 16 TiB |
500 GiB - 16 TiB |
Max. IOPS(1)/Volume |
10,000 |
32,000 |
500 |
250 |
Max. Throughput/Volume |
160 MiB/s |
500 MiB/s |
500 MiB/s |
250 MiB/s |
IOPS |
* 3 IOPS per GB (larger volume means more IOPS) * 100 IOPS <-> 10,000 IOPS * Can burst to 3,000 IOPS if volume size is < 1TB * Requires credits that are acquired per 3 IOPS/GB/second * Max 5.4 miilion credit (also intitial value), enough for 3,000 IOPS for 30min * Running out of credits reverts volume back to baseline performance |
* 30 IOPS per GB (larger volume means more IOPS), up to 20,000 * Does not burst, delivers consistent IOPS rate instead |
. |
. |
(1) gp2/io1 based on 16 KiB I/O size, st1/sc1 based on 1 MiB I/O size
- Using EBS optimized instances guarantees optimal networking between EBS and EC2
- Pre-warming/intialization
- No longer needed for new EBS volumes
- Storage blocks on volumes restored from snapshots do need to be initialized (read from)
↖↑↓ EFS Monitoring
- Two throughput modes to choose from for your file system
- Bursting Throughput - throughput on Amazon EFS scales as your file system grows
- Provisioned Throughput - you can instantly provision the throughput of your file system (in MiB/s) independent of the amount of data stored.
. |
Amazon EFS |
Amazon EBS Provisioned IOPS (io1 ) |
Per-operation latency |
Low, consistent latency. |
Lowest, consistent latency. |
Throughput scale |
10+ GB per second. |
Up to 2 GB per second. |
Storage Characteristics Comparison
. |
Amazon EFS |
Amazon EBS Provisioned IOPS |
Availability and durability |
Data is stored redundantly across multiple AZs. |
Data is stored redundantly in a single AZ. |
Access |
Up to thousands of Amazon EC2 instances, from multiple AZs, can connect concurrently to a file system. |
A single Amazon EC2 instance in a single AZ can connect to a file system. |
Use cases |
Big data and analytics, media processing workflows, content management, web serving, and home directories. |
Boot volumes, transactional and NoSQL databases, data warehousing, and ETL. |
S3 vs EFS vs EBS Comparison
Amazon S3 |
Amazon EBS |
Amazon EFS |
Can be publicly accessible |
Accessible only via the given EC2 Machine |
Accessible via several EC2 machines and AWS services |
Web interface |
File System interface |
Web and file system interface |
Object Storage |
Block Storage |
Object storage |
Scalable |
Hardly scalable |
Scalable |
Slower than EBS and EFS |
Faster than S3 and EFS |
Faster than S3, slower than EBS |
Good for storing backups |
Is meant to be EC2 drive |
Good for shareable applications and workloads |
↖↑↓ CloudWatch
Monitoring service that plugs into many other services
- Metrics
- Based on currently used service
- Not everything is available out of the box, e.g. no data on memory usage of EC2 instances
- Alarms
- Based on thresholds defined on metrics
- Can be added to dashboard
- Invoke Lambda, SNS, email, ...
- Takes place once, at a specific point in time
- Disable with
mon-disable-alarm-actions
via CLI
- Logs
- Events
- Define actions on things that happened
- Define
cron
-based events
- Events are recorded constantly over time
Key metrics for EC2
- EC2 metrics are based on what is exposed to the hypervisor.
- Basic Monitoring (default) submits values every 5 minutes, Detailed Monitoring every minute
- Can install Cloudwatch agent (new)
- Provides access to more metrics
- Can use Cloudwatch monitoring scripts (old) to provide more metrics
- Perl-scripts provided by AWS, need to manually install on instance
- Use
cron
to automate sending data to CloudWatch
Metric |
Effect |
CPUUtilization |
The total CPU resources utilized within an instance at a given time. |
DiskReadOps ,DiskWriteOps |
The number of read (write) operations performed on all instance store volumes. This metric is applicable for instance store-backed AMI instances. |
DiskReadBytes ,DiskWriteBytes |
The number of bytes read (written) on all instance store volumes. This metric is applicable for instance store-backed AMI instances. |
NetworkIn ,NetworkOut |
The number of bytes received (sent) on all network interfaces by the instance |
NetworkPacketsIn ,NetworkPacketsOut |
The number of packets received (sent) on all network interfaces by the instance |
StatusCheckFailed ,StatusCheckFailed_Instance ,StatusCheckFailed_System |
Reports whether the instance has passed both/instance/system status check in the last minute. |
- Can not monitor memory usage, available disk space, swap usage
Key metrics for EBS
Metric |
Effect |
VolumeReadBytes ,VolumeWriteBytes |
sum reports total bytes transferred, average also useful |
VolumeReadOps ,VolumeWriteOps |
total number of IO operations |
VolumeQueueLength |
Number of read/write operation requests waiting to finish |
VolumeTotalReadTime ,VolumeTotalWriteTime |
Total number of seconds spent by all operations in a given time |
VolumeThroughputPercentage |
Percentage of IOPS that was achieved out of total provisioned IOPS |
VolumeConsumedReadWriteOps |
Total amount of r/w operations consumed within a specific time period |
- Can not monitor disk usage percentage
Key metrics for EFS
Metric |
Effect |
BurstCreditBalance |
The number of burst credits that a file system has. |
ClientConnections |
The number of client connections to a file system. |
DataReadIOBytes ,DataWriteIOBytes |
The number of bytes for each file system read(write) operation. |
MetadataIOBytes |
The number of bytes for each metadata operation. |
PercentIOLimit |
Shows how close a file system is to reaching the I/O limit of the General Purpose performance mode. |
PermittedThroughput |
The maximum amount of throughput a file system is allowed. |
TotalIOBytes |
The number of bytes for each file system operation, including data read, data write, and metadata operations. |
Key metrics for ELB (classic load balancer)
Metric |
Effect |
Latency |
Time it takes to receive an response. Measure max and average |
BackendConnectionErrorr |
Number of not successfully established connections to registered instances, measure sum and look at difference between min and max |
SurgeQueueLength |
Total number of request waiting to get routed, look at max and average |
SpilloverCount |
Dropped requests because of exceeded surge queue. Look at sum |
HTTPCode_ELB_3XX_Count
HTTPCode_ELB_4XX_Count
HTTPCode_ELB_5XX_Count |
The number of HTTP XXX server error codes that originate from the load balancer. This count does not include any response codes generated by the targets. |
RequestCount |
Number of completed requests |
HealthyHostCount ,UnhealthyHostCount |
Self explainatory |
- In case of sudden and very large increases in traffic it's possible to contact AWS and have them
'pre-warm' the ELB.
spillover and surge queue give an indication of the ELB being overloaded
- Typically this means that the backend system cannot process requests as fast as they are coming in
- Ideally load balance into an autoscaling group.
Key metrics for ALB (active load balancer)
Metric |
Effect |
RequestCount |
Number of completed requests |
HealthyHostCount ,UnhealthyHostCount |
Self explainatory |
TargetResponseTime |
The time elapsed after the request leaves the load balancer until a response from the target is received. |
HTTPCode_ELB_3XX_Count
HTTPCode_ELB_4XX_Count
HTTPCode_ELB_5XX_Count |
The number of HTTP XXX server error codes that originate from the load balancer. This count does not include any response codes generated by the targets. |
Key metrics for NLB (network load balancer)
Metric |
Effect |
processedbyte |
The total number of bytes processed by the load balancer, including TCP/IP headers. |
tcp_client_reset_count |
the total number of reset (rst) packets sent from a client to a target. |
tcp_elb_reset_count |
the total number of reset (rst) packets generated by the load balancer. |
tcp_target_reset_coun |
the total number of reset (rst) packets sent from a target to a client. |
Key metrics for elasticache
Supports memcached and redis
Metric |
memcached |
redis |
. |
Designed for simplicity |
Supports a much richer set of features. can be backed up if in cluster mode |
cpu utilization |
* multithreaded * stay under 90%/#cores * -> increase # read replicase or use larger cache instance |
* single threaded * stay under 90% * -> increase size of node or add more nodes |
evictions |
* -> increase size or add nodes to cluster |
* -> increase node size |
concurrent connections |
* -> check application logic |
* -> check application logic |
swap usage |
* avoid swapping -> increase memcached_connections_overhead |
avoid swapping * -> increase node size * -> increase memory connection overhead (will decrease memory available for cache) |
.*
Key metrics for RDS
Metric |
Effect |
CPUUtilization |
Percentage of CPU utilization |
DatabaseConnections |
Number of connections that we have at a given point in time |
DiskQueueDepth |
Number of read/write requests waiting to access the disk |
FreeableMemory |
Amount of available RAM |
FreeStorageSpace |
Amount of available storage space |
SwapUsage |
When data is stored in memory on disk |
Increase |
In this usually has to do with running out of available RAMReadIOPS/WriteIOPS |
IOPS |
Represent the number of I/O operations completed per secondIf we don’t have enough IOPS, performance will slow down |
ReadLatency/WriteLatency |
* Average amount of time taken per disk I/O operation (input/output) * High latency can be solved with more IOPSReadThroughput/WriteThroughput * Average is number of bytes read or written to or from disk per second |
.*
↖↑↓ Consolidated Billing
Set up a billing account to pay for multiple linked accounts at the same time.
- Allows for consolidated billing. Does not give IAM visibility into linked accounts.
- Enables volume discounts across linked accounts.
- If one account uses reserved instances, other accounts running on similar on demand instances
will be billed under the reserved instance price. Similar for RDS instances.
- All credits earned while linked will be applied to consolidated bill.
Limits:
↖↑↓ Billing Metrics & Alarms
- Only shows metrics of services that have been used.
- Set up billing alarms based on billing metrics.
- Overall billing alarm, or service-specific alarms
- Can still be account-specific, even with consolidated billing
↖↑↓ Costs Optimization
- Purchase EC2 Reserved Instances
- Commit for 1-3 years and get a discount
- Minimize the number of running instances
- Set up CloudWatch alarms to spin down underutilized instances
- Find balance between acceptable downtime & costs to eleminate this downtime
- Remove unused Load Balancers
- Look for idle (unattached) EBS volumes
- Delete unused volumes
- Take a snapshot to keep the data
- Downsize volumes that aren't near full capacity
- Look for over-provisoned IOPS
- Look for unassociated Elastic IP addresses
- Look for idle RDS instances
↖↑↓ Cost Explorer
- Costs per time frame per service, various grouping and filtering options
- Provides forecasts
- Pricing API allows to download pricing information for specific services
↖↑↓ High Availability
↖↑↓ Scalability & Elasticity Fundamentals
- Pay only for what you need when you need it
- Define minimum capacity
- Define what needs to stretch out
. |
Elasticity |
Scalability |
. |
Scaling up/down on demand |
Scaling for growth in order to meet long term requirements typically does not focus on shrinking back |
DynamoDb |
Can provision more or less throughput |
Stores as much data as we like, scales transparently |
EC2 |
Use autoscaling |
More instances or bigger instance types |
RDS |
./. |
Bigger instances, more read replicas |
↖↑↓ Reserved Instances
- Reserve instances for a specific period of time
- Standard reserved instances (fixed instance type)
- Convertible reserved instances (can be exchanged against another convertible instance type)
- Scheduled reserved instances (purchased by the hour on a set schedule with a set instance type)
- Up to 50% cheaper than a fully utilized on-demand instance (because we commit upfront to a certain usage)
- Guarantees to not run into 'insufficent instance capacity' issues if AWS is unable to provision instances in that AZ
- Can resell reserved capacity on Reserved Instance Marketplace
- Available for:
- EC2
- RDS (reserved instances)
- DynamoDB (reserved capacity)
- ElastiCache (reserved nodes)
- CloudFront (reserved capacity)
- Elastic MapReduce (reserved EC2 instances)
- ECR (reserved EC2 instances)
↖↑↓ Autoscaling vs Resizing
↖↑↓ Load Balancers
. |
ALB |
NLB |
ELB |
. |
Active Load Balancer |
Network Load Balancer |
Classic Load Balancer |
Layer |
7 (application layer) |
4 (transport layer) |
EC2-classic network (deprecated) |
Protocoll |
HTTP, HTTPS |
TCP |
TCP, SSL, HTTP, HTTPS |
Health checks |
✔ |
✔ |
✔ |
Cloudwatch metrics |
✔ |
✔ |
✔ |
Logging |
✔ |
✔ |
✔ |
Zone failover |
✔ |
✔ |
✔ |
Connection draining |
✔ |
✔ |
✔ |
Load balancing to different ports on the same instance |
✔ |
✔ |
. |
WebSockets |
✔ |
✔ |
. |
IP Addresses as targets |
✔ |
✔ |
. |
Load balancing deletion protection |
✔ |
✔ |
. |
Path-based routing |
✔ |
. |
. |
Host-based routing |
✔ |
. |
. |
Native http/2 |
✔ |
. |
. |
Configurable idle connection timeout |
✔ |
. |
✔ |
Cross zone load-balancing |
✔ |
✔ |
✔ |
SSl-offloading |
✔ |
. |
✔ |
Server-name indication |
✔ |
. |
✔ |
Sticky-sessions |
✔ |
. |
✔ |
Backend server encryption |
✔ |
. |
✔ |
Static IP |
. |
✔ |
. |
Elastic IP |
. |
✔ |
. |
Preserve source IP address |
. |
✔ |
. |
Resource-based IAM permissions |
✔ |
✔ |
✔ |
Tag-based IAM permissions |
✔ |
✔ |
. |
Slow start |
✔ |
. |
. |
User authenticaion |
✔ |
. |
. |
Redirects |
✔ |
. |
. |
Fixed responses |
✔ |
. |
. |
Elastic Load Balancer ('Classic LB')
Overview
- External load balancer
- Public facing
- Often used to distribute load between web servers
- Provides public DNS host name
- Internal load balancer
- Often used to Distribute load between backend servers
- Provides internal DNS host name
- Configure (in AWS console)
- Internal and external load balancer
- Subnets for each AZ that traffic should be routed to
- Can route into private subnets
- Cross-zone load balancing
- Connection draining (maximum time for the load balancer to keep connections alive before reporting the instance as
de-registered)
Sticky Sessions
- Need to make sure that session is maintained between instances
- Load Balancer generated stickiness (duration based session stickiness)
- Application generated stickiness (application based session stickiness)
- For HA, use ElastiCache to persist and share session state. So maintaining
stickiness doesn't matter any more
↖↑↓ RDS HA
- Create subnets in different AZs
- Create subnet group in RDS dashboard
- Collection of subnets (typically private) in a VPC that is desgnated for DB instances
- Should have subnets in at least two Availability Zones in a given region
- Configure RDS for multi-AZ-deployments and turn replication on
- Keeps a synchronous standby replica in a different AZ
- Recommendation is use of Provisioned IOPS
- Automatic failover in case of planned or unplanned outage of the first AZ
- Most likely still has downtime
- Can force failover by rebooting
- Other benefits
- Aurora can replicate accross 3 AZs
- Failover process is automated
- AWS detects an issue and starts the failover process
- DNS records are modified to point to the standby instance
- Application re-establishes existing DB connections
↖↑↓ HA for IP-based Applications
- If the application requires specific IPs (that are hardcoded somewhere), autoscaling cannot be used
- Use Elastic IP and standby instances in different AZs instead
- Cannot use Elastic IP across different regions though
- Scale by increasing instance size (vertical scaling)
↖↑↓ HA/Fault Tolerance for Bastion Hosts
- Assign Elastic IP to bastion host in AZ 1
- This IP can also be whitelisted to comply with corporate regulations
- Have another instance on standby in different AZ
- Could be in ASG (min/max 1), so that it gets immediately replaced
- Place 2 instances behind ELB and enable SSH Keep Alive
- Place 1 instance behind ELB, configure auto recovery
↖↑↓ Analysis
Offloading database workload
Looking at EBS volumes
- EBS pre-warming
- Used to be required for maximum performance
- Performance is reduced the very first time each block is accessed
- Has been renamed to initialization and is no longer required if new EBS volumes are used
- Still required for volumes that are restored from snapshots
- Storage blocks must be initialized (pulled down from Amazon S3 and written to the volume)
- Use
dd
or fio
to read from every block
- Only required if performance matters, obviously
Prewarming ELBs
- ELB is designed to increase its resource capacity gradually
- Prevents
http 503
(ELB cannot handle anymore requests)
- Can contact AWS to
pre-warm
ELB
- This should not really be required. Maybe if TV ads are running or so.
- Use load testing tools to get a rough estimate of what the current ELB can handle
- Increase at a rate no more than 50% per 5min.
Resizing or changing EBS root volumes
- If EBS is at capacity
- Either upgrade volume size to increase the amount of IOPS available
- Or switch to provisiones IOPS volumes (
io1
)
- Resizing
- Create snapshot of EBS volume first
- Incrementally stored on S3
- Can continue to use EBS volume while the snapshot is taking place
- Create new volume from snapshot
- Stop instance
- Attach new volume
Setting up certificates for Elastic Load Balancers
- Offloading overhead from the instances behind the ELB
- Create ELB and configure https
- Certificate from
- ACM (AWS managed)
- IAM (for external certificiates)
- Upload directly
Network bottlenecks
- Primary network bottlenecks
- EC2 instances
- Instances in different AZs or regions
- Different instance types get different bandwith capacities
- No absolute numbers communicated by AWS though
- Not using enhanced network capabilities (not supported by some instance types)
- Check for performance issues with
iperf3
(github)
- Measures performance for ip-based networks
- Use VPC Peering to create a reliable connection
- No single point of failure
- Connection to on-prem networks
↖↑↓ Identify Potential Issues on a Given Application Deployment
EBS Root Devices on Terminated Instances - Ensuring Data Durability
- EBS root volumes will be deleted on instance termination as per default option
- Could create snapshot before termination to backup data
- Could change default settings
- Instance store root volumes will be left untouched on instance termination
Troubleshooting Auto Scaling Issues
- Attempting to use wrong subnet
- AZ no longer available or supported (outage)
- Security group does not exist
- Associated keypair does not exist
- Auto scaling configuration is not working correctly
- Instance type specification does not exist in that AZ
- Auto scaling is not enabled on that subnet
- Invalid EBS device mapping
- Attempt to attach EBS block device to instance-store AMI
- AMI issues
- Attempt to use placement groups with instance types that don't support that
- AWS running out of capacity in that AZ
- If an instance is stopped, e.g. for updating it, autoscaling will consider it unhealthy and
terminate - restart it. Need to suspend autoscaling first.
↖↑↓ OpsWorks
↖↑↓ Overview and components
- Declarative desired state engine
- Automate, monitor and maintain deployments
- Cookbooks define recipes
- AWS' implementation of Chef
- Original Chef
- AWS-bespoke orchestration components
- Components
- Stack
- Set of resources that is managed as a group
- Layer
- Represent and configure components of a stack
- E.g. loadbalancer layer, app layer, db layer
- Share common configuration elements
- Instance
- Units of compute within the platform
- Must be associated with at least one layer
- Can run
- 24/7
- Load-based
- Time-based
- Application
- Applications that are deployed on one or more instances
- Deployed through source code repo or S3
- Recipes
- Created in ruby, used to customize different layers
- Run at stack lifecycle events
setup
- Instance has finished booting
configure
- Instance enters or leaves the
online
state
- Elastic IP is associated or disassociated
- Load balancer is attached or detached
- Event is executed on all instances, not only the impacted one
deploy
- Deploy command is run on an instance
undeploy
- Undeploy command is run on an instance
- App is deleted
shutdown
- When instance is shutdown, before termination
- Allows cleanup
- Under the hood
- OpsWorks agent
- Configuration of machines
- OpsWorks automation engine
- Create, update & delete of various AWS components
- Handles loadbalancing, autoscaling and autohealing
- Supports lifecycle events
BerkShelf
- Addresses an OpsWorks shortcoming from old versions - only one repository for recipes
- Was added in OpsWorks 11.10 and allows to install cookbooks from many repositories
TODO: Quickstart OpsWorks
Overview
- Allows to create and provision resources in a reusable template fashion
- A CloudFormation template is a
JSON
or YAML
formatted text file
- Related resources are managed in a single unit called a stack
- Controls lifecycle of managed resources
- All the resources in a stack are defined by the stack's CloudFormation template
- Stack has
name
& id
- Two ways to update a stack
- Direct update
- Directly applies changes (if any)
- Change set
- Summary of proposed changes, can be applied or rejected
- Will rollback stack if it fails to create (can be disabled via API/console)
- A stack policy is an IAM-style policy statements that governs who can do what
Templates
AWSTemplateFormatVersion
Description
Metadata
- Details about the template
Parameters
- Values to pass in right before template creation
- Type
String
, Number
, List
, CommaDelimitedList
- AWS-specific types like
AWS::EC2::KeyPair::KeyName
- Description
- Default Value
- Allowed Values
- Allowed Pattern
- Validation per regular expression
- MinLength/MaxLength
- MinValue/MaxValue
- Problem:
- Usage of parameters might make it hard to instantiate stacks without human interaction
- CloudFormation is able to auto-generate many resources attributes, e.g. name
Mappings
- Maps keys to values (eg different values for different regions)
Conditions
- Check values before deciding what to do
Resources
- Creates resources. Only mandatory section in a template.
- Can have
Condition
element to toggle creation
Outputs
- Values to be exposed from the console or from API calls.
- Can be used in a different stack (cross stack references)
- Can be:
- Constructed value
- Parameter reference
- Pseudo parameter
- Output from a function like
fn::getAtt
or Ref
Intrinsic Functions
- Used to pass in values that are not available until runtime
- Usable in
resource
properties, metadata
attributes, and update policy
attributes (auto-scaling)
Ref
- Returns the default value of the specified parameter or resource, usually instance id
Fn::GetAtt
- Returns the value of an attribute from an object, either the default or the specified attribute
- Object is either from the same or a nested template
Fn::Join
- Joins a set of values into a single value separated by the specified delimiter
Fn::Sub
- Substitutes variables in an input string with values that you specify
Fn::FindInMap
- Returns the value corresponding to keys in a two-level map that is declared in the Mappings
section
Fn::Select
- Returns a single object from a list of objects by index
Fn::Base64
- Provides encoding, converts from plain text into base64
Fn::GetAZs
- Returns an array that lists Availability Zones for a specified region
- If region is omitted return AZs from the region the template is applied in
Fn::ImportValue
- Returns the value of an Output exported by another stack
Fn::Split
- Split a string into a list of string values so that you can select an element from the resulting
string list
Fn::If
- Takes a list of arguments (
boolean
, string1
, string2
)
- Returns
string1
if boolean
is true
, string2
otherwise
Fn::And
, Fn::Equals
, Fn::Or
, Fn::Not
- Good for
condition
element
↖↑↓ Backups & Recovery
↖↑↓ AWS Services with automated backups
-
RDS
-
Backups
- Transactional storage engine recommended as DB engine
- Degrades performance if multi-AZ is not enabled (taken from slave if enabled)
- Deleting an instance deletes all automated backups
- Backups are stored internaly on S3
- PITR 5 minutes
-
Restoring
- When restoring, only default parameters and security groups are associated with instance
- Can change to different storage engine if closely related and enough space available
-
Elasticache
- Backups
- Available to Redis cluster only
- Taking snaphots can degrade performance, should be performed on read replica
- Backups are stored internaly on S3
-
Redshift
- Backups
- Provides free storage equal to the storage capacity of the cluster
- Snapshots can be automated or manual and are incremental
- Backups are stored internaly on S3
- Restoring
- Creates a new cluster and imports the data
-
EC2
- Backups
- No built-in automated backup solution
- Snapshots of EBS volumes are incremental, causing performance degradation
- Every snapshot will restore all data, even if older snapshots are deleted
- Backups are stored internaly on S3
↖↑↓ Disaster Recovery Scenarios
DR of on-prem infra
- Use AWS as backup solution by storing VMs, snapshots and other data
- 'Pilot light' - have bare minimum infra always ready and scale up as required
- 'Hot standby' (aka 'multi site') - has everything ready to go
DR of cloud infra
- Duplicate the environment from one region to another
DR of RDS data
- Protection from multiple AZs being down
- Reduce latency for global audience
- Replica lag will most likely go up
- Data transfer across regions is getting charged
- May potentially run into bandwith issues
- Create read replica from existing DB instance, pick different region
- Trigger setup process that will take some time
↖↑↓ Storing log files and backups
↖↑↓ Security
↖↑↓ Implement and Manage Security Policies
IAM
IAM is a global service that helps to securely control access to AWS resources.
- Users hold credentials
- Groups hold users, typically only provides permission to assume a role
- Roles hold policies.
- Can have trust relationships with trusted entities that can assume this role
- Policies can be attached to users, groups or roles (preferred)
- An instance profile is a container for an IAM role that you can use to pass role information to an
EC2 instance when the instance starts.
- Users and/or services assume roles
Policies
- Any actions on resources that are not explicitly allowed are denied by default
- Structure
- E -
effect
(allow/deny)
- What the effect will be when the user requests the specific action
- P -
prinicpal
(ARN)
- The account or user who is allowed access to the actions and resources in the statement
- IAM policies do not have a principal (because they are attached to users, groups or roles)
- A -
action
or notaction
- Describes the specific action or actions that will be allowed or denied
- R -
resource
or notresource
- Specifies the object or objects that the statement covers
- C -
condition
- Specifies conditions for when a policy is in effect
- Can use policy variables
aws:currentTime
, aws:userid
, ...
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::productionapp"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::productionapp/*"
}
]
}
IAM Policies
- Managed policies (the new way)
- Can be attached to multiple users, groups and roles
- AWS managed policies
- Updated by AWS if new API come out
- Customer managed policies
- Inline policies (the old way)
IAM roles and EC2
S3 IAM and bucket policy concepts
Defaults
- Bucket is owned by the AWS account that created it
- Bucket ownership is not transferable
- Bucket owner gets full permission (ACL)
- The person paying the bills always has full control.
- A person uploading an object into a bucket owns it by default.
Bucket policies (resource level)
- Specify what actions are allowed or denied for which principals on the bucket that the policy
is attached to
- Attached only to S3 buckets. Can however effect object in buckets.
- Contains principal element (unnecessary for IAM policies)
- Use if you’re more interested in “Who can access this S3 bucket?”
- Easiest way to grant cross-account permissions for all
s3:*
permission. (Cannot do this
with ACLs.)
- Explicit deny in bucket policy overwrites explicite allow in IAM policy
- Defined as JSON
{
"Version":"2012-10-17",
"Statement":
[
{
"Sid":"PutObjectAcl",
"Effect":"Allow",
"Principal":
{
"AWS":
[
"arn:aws:iam::111122223333:tom", "arn:aws:iam::444455556666:chris"
]
},
"Action":
[
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource":
[
"arn:aws:s3:::examplebucket/*"
]
}
]
}
ACLs
- Defined as XML. Legacy, not recomended any more.
- Can
- be attached to individual objects (bucket policies only bucket level)
- control access to object uploaded into a bucket from a different account.
- Cannot..
- have conditions
- cannot explicitely deny actions
- grant permission to bucket sub-resources (eg. lifecycle or static website configurations)
- Other than object ACLs there are bucket ACLs as well - only for writing access log objects to a
bucket.
<?xml version="1.0" encoding="UTF-8"?>
<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Owner>
<ID>*** Owner-Canonical-User-ID ***</ID>
<DisplayName>owner-display-name</DisplayName>
</Owner>
<AccessControlList>
<Grant>
<Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="Canonical User">
<ID>*** Owner-Canonical-User-ID ***</ID>
<DisplayName>display-name</DisplayName>
</Grantee>
<Permission>FULL_CONTROL</Permission>
</Grant>
</AccessControlList>
</AccessControlPolicy>
IAM policies (user level)
- IAM policies (in general) specify what actions are allowed or denied on what AWS resources
- Attached to IAM users, groups, or roles (so they cannot grant access to anonymous users)
- Use if you’re more interested in “What can this user do in AWS?”
. |
. |
arn:partition:service:region:namespace:relative-id |
arn:aws:s3:::mybucket |
arn:aws:s3:::* |
All buckets and objects in account |
arn:aws:s3:::mybucket |
mybucket |
arn:aws:s3:::mybucket/* |
All objects in mybucket |
arn:aws:s3:::mybucket/mykey |
mykey in mybucket |
arn:aws:s3:::mybucket/developers/($aws:username)/ |
folder matching the accessing user's name |
Cloudfront
- Can use Cloudfront Origin Access Identity to restrict access to S3 objects
MFA
-
Should be turned on for all console access
-
Can be enabled for API access as well
- The administrator configures an AWS MFA device for each user who needs to make API requests that
require MFA authentication. This process is described at Enabling MFA Devices.
- The administrator creates policies for the users that include a Condition element that checks
whether the user authenticated with an AWS MFA device.
- The user calls one of the AWS STS API operations that support the MFA parameters
AssumeRole
or
GetSessionToken
, depending on the scenario for MFA protection, as explained later. As part of the
call, the user includes the device identifier for the device that's associated with the user. The
user also includes the time-based one-time password (TOTP) that the device generates. In either case,
the user gets back temporary security credentials that the user can then use to make additional
requests to AWS.
- This is not supported by all services (support by SQS, SNS, S3)
-
MFA delete can be enabled for root accounts (bucket owners) before permanently deleting an object
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": ["ALICE", "BOB"]},
"Action": [ "s3:PutObject", "s3:DeleteObject" ],
"Resource": ["arn:aws:s3:::Alice-Bucket/*"],
"Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}
}]
}
Secure Token Service (STS)
- Allows to grant temporary access to authenticated users
- IAM users
- Web-based identity providers (google, facebook, ...)
- Organization's existing identity system
- Returns temporary credentials that expire after some time:
Terms
- Federation
- Trust relationship between identity provider and AWS
- Identity broker
- Broker in charge of mapping user to the right set of credentials
- Identity store
- Identities
Scenarios
- Temporary credentials with EC2
- Assign IAM role to instance
- Get temp credentials from instance metadata
- Temporary credentials with SDK
- Call
assumeRole
, extract temp credentials
- Options for temporary credentials with API calls
- Sign request with temp credentials
- Add AC/SK to request (header or query string)
↖↑↓ Share responsibility model
- Shared responsibility environment
- AWS is responsible for:
- Server/Host level and below
- Physical environment security
- Hardware decommissioning
- Traffic security (Networks, ACLs, SSL, DDOS-protection)
- EC2 hypervisor isolation
- User is responsible for:
- IAM
- MFA
- Password/key-rotation
- Access advisor (shows used permissions)
- Trusted advisor (validates best practices)
- Security groups
- ACL (resource based policy)
- VPC
↖↑↓ AWS and IT Audits
- AWS performs self audits of changes to key services to monitor quality, maintain high standards, and
facilitate continuous improvement of the change management process
- For audits, AWS provides:
- Security of the cloud
- Information regarding their global infrastructure
- From the host operating system and virtualization layer down to the physical security of facilities
- Annual certifications and reports: (like the Service Organization Control (SOC) reports, ISO 27001
cert, PCI assessments)
- For audits, the customer provides:
- Security in the cloud
- Anything their organization puts on (or connects to) their AWS assets
Examples: guest operating system, apps on virtual machine instances, objects in S3, database like RDS,
etc...
↖↑↓ Networking
↖↑↓ Route53 Routing Policies
- Simple
- Weighted
- Latency
- Failover
- Geolocation
DNS Failover
- Can set up health checks for endpoints or domains from within Route53
- Route 53 has health checkers in locations around the world. When you create a health check that
monitors an endpoint, health checkers start to send requests to the endpoint that you specify
to determine whether the endpoint is healthy.
evaluate target health
- DNS entries are then being associated with health checks and can be configured to failover as
well (1 primary and n secondary recordsets)
Weighted
- Control distribution of traffic with DNS entries
- This can be based on a certain percentage
- Set routing policy to weighted (instead of failover)
Latency-based
- Control distribution of traffic based on latency.
↖↑↓ VPC Essentials
- Provisions a logically isolated section of the AWS cloud
- Spans over all AZs in a region
- Allows to create layered architecture
- Shared or dedicated tenancy (exclusive hardware or not)
- Security groups and subnet network ACLs
- Ability to extend on-premise network to cloud
Default VPC (Amazon specific)
- Gives easy access to a VPC without having to configure it from scratch
- Has different subnets in different AZs and an internet gateway per AZ
- Each instance launched automatically receives a public IP (very different to non-default VPC)
- Cannot be restored if deleted
Non-default VPC (regular VPC)
- Only has private IP addresses
- Resources only accessible through Elastic IP, VPN or internet gateways
- Does not have a gateway attached
VPC Peering
- Connect VPCs through direct network routing
- Can occur between different accounts and VPCs, but must be in the same region
- Allows instances to communicate with each other as if they were in the same network
- CIDRs must not overlap
VPC Scenarios
- VPC with private subnet only -> single tier apps
- VPC with public and private subnets -> layered apps
- VPC with public, private subnets and hardware connected VPN -> extending apps to on-premise
- VPC with private subnets and hardware connected VPN -> extended VPN
Components
- Subnet
- In exactly one AZ
- If a subnet doesn't have a route to the Internet gateway, it's known as a private subnet
- Instances receive
- Private IP address
- Internal DNS hostname
- If traffic is routed to an Internet gateway, the subnet is known as a public subnet
- Instances receive
- Public IP address
- External DNS hostname
- EC2 instances are launched into subnets
- Use ssh-agent forwarding to connect from public to private instances
- Sometimes grouped into Subnet Groups, e.g. for caching or DB. Typically across AZs
- Route Table
- Contains a set of rules, called routes that determine where network traffic is directed to
- Each VPC automatically comes with a main route table that can be configured
- Each subnet in a VPC must be associated with a route table; the table controls the routing
for the subnet. A subnet can only be associated with one route table at a time, but multiple
subnets can be associated with the same route table
- Each route in a table specifies a destination CIDR and a target
- Every route table contains a local route for communication within the VPC
- Can have a default route 0.0.0.0/0 to route everything that doesn't have a specific rule
- Elastic IP
- Static IPv4 address mapped to an instance or network interface
- If attached to network interface it's decoupled from the instance's lifecycle
- Routes to private IP address of instance
- Can be remapped in case of failure.
- For use in a specific region only
- Can only map to instances in public subnets
- Gateways
- Internet Gateway
- Horizontally scaled, redundant, and highly available VPC component that allows communication
between instances in a VPC and the internet
- Provides a target in VPC route tables for internet-routable traffic
- Performs network address translation (NAT) for instances that have been assigned public
IPv4 addresses
- Virtual Private Gateway
- Has VPN connection to customer gateway attached
- Serves as VPN concentrator on the Amazon side of the VPN connection
- Customer Gateway
- A physical device or software application on your side of the VPN connection
- NAT
- NAT Instances
- Manually configured instance from an NAT AMI
- NAT Gateway
Security
Network ACL
- Subnet level, acting as firewall
- Rules for inbound and outbound traffic
- Rules have numbers and are evaluated from low to high, first matching rule wins, others are not evaluated
- Stateless
Security Groups
- Acts as a virtual firewall to control inbound and outbound traffic to instances
- Acts on instance level, not subnet level
- Rules for inbound and outbound traffic
- Stateful - will always allow response to (allowed) outbound traffic
- Can refer to other security group, e.g. allow traffic from there
Structure & package flow
-
VPC (has CIDR)
- Gateway (Internet or VPN)
- Routes (one per subnet, can be shared)
- Network ACL (one per subnet, can be shared)
- Subnets (CIDRs match VPC's CIDR)
- Security Group (on VPC level)
- Instance (needs public IP for internet communication, either ELB or Elastic IP)
-
Flow from internet
- Internet Gateway
- VPC Router (routes into desired subnet)
- Route Table (of that subnet)
- NACL
- Security Group
- Instance
Connection To On-prem Network/Direct Connect
- VPC
- (has attached) Virtual Private Gateway
- (has attached) VPN Connection
- (has attached) Customer Gateway
TODO: VPN vs direct connect. Can I use VPN instead of DC?
↖↑↓ Limits:
. |
. |
VPCs per region |
5 |
Subnets per VPC |
200 |
Customer gateways per region |
50 |
Virtual private gateways per region |
5 |
Virtual private gateways per VPC |
1 |
Gateway per region |
5 Internet |
Elastic IPs per account per region |
5 |
VPN connections per region |
50 |
Route tables per region |
200 |
Security groups per region |
500 |
↖↑↓ Accessing the OS
- Services that allow access the the underlaying OS
- EC2
- ECS
- EB (Elastic Bean Stalk)
- EMR (Elastic Map Reduce)
- OpsWorks
- Services that hide the OS away (managed services)
- Default message retention period: 4 days (max 14 days)
DelaySeconds
will delay a message appearing in the queue
- Setting
WaitTimeSeconds
will enable long polling (can be more cost efficient)
↖↑↓ DynamoDb
- Prefix partition key with hash to enforce even distribution of IO across many partitions