AWS

Creating a cluster via the AWS CLI.

Creating a Cluster via the AWS CLI

In this guide we will create an HA Kubernetes cluster with 3 control plane nodes across 3 availability zones. You should have an existing AWS account and have the AWS CLI installed and configured. If you need more information on AWS specifics, please see the official AWS documentation.

To install the dependencies for this tutorial you can use homebrew on macOS or Linux:

brew install siderolabs/tap/talosctl kubectl jq curl xz

If you would like to create infrastructure via terraform or opentofu please see the exmaple in the contrib repository.

Note: this guide is not a production set up and steps were tested in bash and zsh shells.

Create AWS Resources

We will be creating a control plane with 3 Ec2 instances spread across 3 availability zones. It is recommended to not use the default VPC so we will create a new one for this tutorial.

Change to your desired region and CIDR block and create a VPC:

Make sure your subnet does not overlap with 10.244.0.0/16 or 10.96.0.0/12 the default pod and services subnets in Kubernetes.

AWS_REGION="us-west-2"
IPV4_CIDR="10.1.0.0/18"
VPC_ID=$(aws ec2 create-vpc \
    --cidr-block $IPV4_CIDR \
    --output text --query 'Vpc.VpcId')

Create the Subnets

Create 3 smaller CIDRs to use for each subnet in different availability zones. Make sure to adjust these CIDRs if you changed the default value from the last command.

IPV4_CIDRS=( "10.1.0.0/22" "10.1.4.0/22" "10.1.8.0/22" )

Next create a subnet in each availability zones.

Note: If you’re using zsh you need to run setopt KSH_ARRAYS to have arrays referenced properly.

CIDR=0
declare -a SUBNETS
AZS=($(aws ec2 describe-availability-zones \
    --query 'AvailabilityZones[].ZoneName' \
    --filter "Name=state,Values=available" \
    --output text | tr -s '\t' '\n' | head -n3))

for AZ in ${AZS[@]}; do
        SUBNETS[$CIDR]=$(aws ec2 create-subnet \
            --vpc-id $VPC_ID \
            --availability-zone $AZ \
            --cidr-block ${IPV4_CIDRS[$CIDR]} \
            --query 'Subnet.SubnetId' \
            --output text)
        aws ec2 modify-subnet-attribute \
            --subnet-id ${SUBNETS[$CIDR]} \
            --private-dns-hostname-type-on-launch resource-name
        echo ${SUBNETS[$CIDR]}
        ((CIDR++))
done

Create an internet gateway and attach it to the VPC:

IGW_ID=$(aws ec2 create-internet-gateway \
    --query 'InternetGateway.InternetGatewayId' \
    --output text)

aws ec2 attach-internet-gateway \
    --vpc-id $VPC_ID \
    --internet-gateway-id $IGW_ID

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
        --filters "Name=vpc-id,Values=$VPC_ID" \
        --query 'RouteTables[].RouteTableId' \
        --output text)

aws ec2 create-route \
    --route-table-id $ROUTE_TABLE_ID \
    --destination-cidr-block 0.0.0.0/0 \
    --gateway-id $IGW_ID

Official AMI Images

Official AMI image ID can be found in the cloud-images.json file attached to the Talos release.

AMI=$(curl -sL https://github.com/siderolabs/talos/releases/download/v1.7.6/cloud-images.json | \
    jq -r '.[] | select(.region == "'$AWS_REGION'") | select (.arch == "amd64") | .id')
echo $AMI

If using the official AMIs, you can skip to Creating the Security group

Create your own AMIs

The use of the official Talos AMIs are recommended, but if you wish to build your own AMIs, follow the procedure below.

Create the S3 Bucket

aws s3api create-bucket \
    --bucket $BUCKET \
    --create-bucket-configuration LocationConstraint=$AWS_REGION \
    --acl private

Create the vmimport Role

In order to create an AMI, ensure that the vmimport role exists as described in the official AWS documentation.

Note that the role should be associated with the S3 bucket we created above.

Create the Image Snapshot

First, download the AWS image from a Talos release:

curl -L https://github.com/siderolabs/talos/releases/download/v1.7.6/aws-amd64.raw.xz | xz -d > disk.raw

Copy the RAW disk to S3 and import it as a snapshot:

aws s3 cp disk.raw s3://$BUCKET/talos-aws-tutorial.raw
$SNAPSHOT_ID=$(aws ec2 import-snapshot \
    --region $REGION \
    --description "Talos kubernetes tutorial" \
    --disk-container "Format=raw,UserBucket={S3Bucket=$BUCKET,S3Key=talos-aws-tutorial.raw}" \
    --query 'SnapshotId' \
    --output text)

To check on the status of the import, run:

aws ec2 describe-import-snapshot-tasks \
    --import-task-ids

Once the SnapshotTaskDetail.Status indicates completed, we can register the image.

Register the Image

AMI=$(aws ec2 register-image \
    --block-device-mappings "DeviceName=/dev/xvda,VirtualName=talos,Ebs={DeleteOnTermination=true,SnapshotId=$SNAPSHOT_ID,VolumeSize=4,VolumeType=gp2}" \
    --root-device-name /dev/xvda \
    --virtualization-type hvm \
    --architecture x86_64 \
    --ena-support \
    --name talos-aws-tutorial-ami \
    --query 'ImageId' \
    --output text)

We now have an AMI we can use to create our cluster.

Create a Security Group

SECURITY_GROUP_ID=$(aws ec2 create-security-group \
    --vpc-id $VPC_ID \
    --group-name talos-aws-tutorial-sg \
    --description "Security Group for EC2 instances to allow ports required by Talos" \
    --query 'GroupId' \
    --output text)

Using the security group from above, allow all internal traffic within the same security group:

aws ec2 authorize-security-group-ingress \
    --group-id $SECURITY_GROUP_ID \
    --protocol all \
    --port 0 \
    --source-group $SECURITY_GROUP_ID

Expose the Talos (50000) and Kubernetes API.

Note: This is only required for the control plane nodes. For a production environment you would want separate private subnets for worker nodes.

aws ec2 authorize-security-group-ingress \
    --group-id $SECURITY_GROUP_ID \
    --ip-permissions \
        IpProtocol=tcp,FromPort=50000,ToPort=50000,IpRanges="[{CidrIp=0.0.0.0/0}]" \
        IpProtocol=tcp,FromPort=6443,ToPort=6443,IpRanges="[{CidrIp=0.0.0.0/0}]" \
    --query 'SecurityGroupRules[].SecurityGroupRuleId' \
    --output text

We will bootstrap Talos with a MachineConfig via user-data it will never be exposed to the internet without certificate authentication.

We enable KubeSpan in this tutorial so you need to allow inbound UDP for the Wireguard port:

aws ec2 authorize-security-group-ingress \
    --group-id $SECURITY_GROUP_ID \
    --ip-permissions \
        IpProtocol=tcp,FromPort=51820,ToPort=51820,IpRanges="[{CidrIp=0.0.0.0/0}]" \
    --query 'SecurityGroupRules[].SecurityGroupRuleId' \
    --output text

Create a Load Balancer

The load balancer is used for a stable Kubernetes API endpoint.

LOAD_BALANCER_ARN=$(aws elbv2 create-load-balancer \
    --name talos-aws-tutorial-lb \
    --subnets $(echo ${SUBNETS[@]}) \
    --type network \
    --ip-address-type ipv4 \
    --query 'LoadBalancers[].LoadBalancerArn' \
    --output text)

LOAD_BALANCER_DNS=$(aws elbv2 describe-load-balancers \
    --load-balancer-arns $LOAD_BALANCER_ARN \
    --query 'LoadBalancers[].DNSName' \
    --output text)

Now create a target group for the load balancer:

TARGET_GROUP_ARN=$(aws elbv2 create-target-group \
    --name talos-aws-tutorial-tg \
    --protocol TCP \
    --port 6443 \
    --target-type instance \
    --vpc-id $VPC_ID \
    --query 'TargetGroups[].TargetGroupArn' \
    --output text)

LISTENER_ARN=$(aws elbv2 create-listener \
    --load-balancer-arn $LOAD_BALANCER_ARN \
    --protocol TCP \
    --port 6443 \
    --default-actions Type=forward,TargetGroupArn=$TARGET_GROUP_ARN \
    --query 'Listeners[].ListenerArn' \
    --output text)

Create the Machine Configuration Files

We will create a machine config patch to use the AWS time servers. You can create additional patches to customize the configuration as needed.

cat <<EOF > time-server-patch.yaml
machine:
  time:
    servers:
      - 169.254.169.123
EOF

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines.

talosctl gen config talos-k8s-aws-tutorial https://${LOAD_BALANCER_DNS}:6443 \
    --with-examples=false \
    --with-docs=false \
    --with-kubespan \
    --install-disk /dev/xvda \
    --config-patch '@time-server-patch.yaml'

Note that the generated configs are too long for AWS userdata field if the --with-examples and --with-docs flags are not passed.

Create the EC2 Instances

Note: There is a known issue that prevents Talos from running on T2 instance types. Please use T3 if you need burstable instance types.

Create the Control Plane Nodes

declare -a CP_INSTANCES
INSTANCE_INDEX=0
for SUBNET in ${SUBNETS[@]}; do
    CP_INSTANCES[${INSTANCE_INDEX}]=$(aws ec2 run-instances \
        --image-id $AMI \
        --subnet-id $SUBNET \
        --instance-type t3.small \
        --user-data file://controlplane.yaml \
        --associate-public-ip-address \
        --security-group-ids $SECURITY_GROUP_ID \
        --count 1 \
        --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=talos-aws-tutorial-cp-$INSTANCE_INDEX}]" \
        --query 'Instances[].InstanceId' \
        --output text)
    echo ${CP_INSTANCES[${INSTANCE_INDEX}]}
    ((INSTANCE_INDEX++))
done

Create the Worker Nodes

For the worker nodes we will create a new launch template with the worker.yaml machine configuration and create an autoscaling group.

WORKER_LAUNCH_TEMPLATE_ID=$(aws ec2 create-launch-template \
    --launch-template-name talos-aws-tutorial-worker \
    --launch-template-data '{
        "ImageId":"'$AMI'",
        "InstanceType":"t3.small",
        "UserData":"'$(base64 -w0 worker.yaml)'",
        "NetworkInterfaces":[{
            "DeviceIndex":0,
            "AssociatePublicIpAddress":true,
            "Groups":["'$SECURITY_GROUP_ID'"],
            "DeleteOnTermination":true
        }],
        "BlockDeviceMappings":[{
            "DeviceName":"/dev/xvda",
            "Ebs":{
                "VolumeSize":20,
                "VolumeType":"gp3",
                "DeleteOnTermination":true
            }
        }],
        "TagSpecifications":[{
            "ResourceType":"instance",
            "Tags":[{
          "Key":"Name",
          "Value":"talos-aws-tutorial-worker"
          }]
        }]}' \
    --query 'LaunchTemplate.LaunchTemplateId' \
    --output text)
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name talos-aws-tutorial-worker \
    --min-size 1 \
    --max-size 3 \
    --desired-capacity 1 \
    --availability-zones $(echo ${AZS[@]}) \
    --target-group-arns $TARGET_GROUP_ARN \
    --launch-template "LaunchTemplateId=${WORKER_LAUNCH_TEMPLATE_ID}" \
    --vpc-zone-identifier $(echo ${SUBNETS[@]} | tr ' ' ',')

Configure the Load Balancer

Now, using the load balancer target group’s ARN, and the PrivateIpAddress from the controlplane instances that you created :

for INSTANCE in ${CP_INSTANCES[@]}; do
    aws elbv2 register-targets \
    --target-group-arn $TARGET_GROUP_ARN \
    --targets Id=$(aws ec2 describe-instances \
        --instance-ids $INSTANCE \
        --query 'Reservations[].Instances[].InstanceId' \
        --output text)
done

Bootstrap etcd

Export the talosconfig file so commands sent to Talos will be authenticated.

export TALOSCONFIG=$(pwd)/talosconfig

WORKER_INSTANCES=( $(aws autoscaling \
    describe-auto-scaling-instances \
    --query 'AutoScalingInstances[?AutoScalingGroupName==`talos-aws-tutorial-worker`].InstanceId' \
    --output text) )

Set the endpoints (the control plane node to which talosctl commands are sent) and nodes (the nodes that the command operates on):

talosctl config endpoints $(aws ec2 describe-instances \
    --instance-ids ${CP_INSTANCES[*]} \
    --query 'Reservations[].Instances[].PublicIpAddress' \
    --output text)

talosctl config nodes $(aws ec2 describe-instances \
    --instance-ids $(echo ${CP_INSTANCES[1]}) \
    --query 'Reservations[].Instances[].PublicIpAddress' \
    --output text)

Bootstrap etcd:

talosctl bootstrap

You can now watch as your cluster bootstraps, by using

talosctl health

This command will take a few minutes for the nodes to start etcd, reach quarom and start the Kubernetes control plane.

You can also watch the performance of a node, via:

talosctl dashboard

Retrieve the kubeconfig

When the cluster is healthy you can retrieve the admin kubeconfig by running:

talosctl kubeconfig .
export KUBECONFIG=$(pwd)/kubeconfig

And use standard kubectl commands.

kubectl get nodes

Cleanup resources

If you would like to delete all of the resources you created during this tutorial you can run the following commands.

aws elbv2 delete-listener --listener-arn $LISTENER_ARN
aws elbv2 delete-target-group --target-group-arn $TARGET_GROUP_ARN
aws elbv2 delete-load-balancer --load-balancer-arn $LOAD_BALANCER_ARN

aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name talos-aws-tutorial-worker \
    --min-size 0 \
    --max-size 0 \
    --desired-capacity 0

aws ec2 terminate-instances --instance-ids ${CP_INSTANCES[@]} ${WORKER_INSTANCES[@]} \
    --query 'TerminatingInstances[].InstanceId' \
    --output text

aws autoscaling delete-auto-scaling-group \
    --auto-scaling-group-name talos-aws-tutorial-worker \
    --force-delete

aws ec2 delete-launch-template --launch-template-id $WORKER_LAUNCH_TEMPLATE_ID

while $(aws ec2 describe-instances \
    --instance-ids ${CP_INSTANCES[@]} ${WORKER_INSTANCES[@]} \
    --query 'Reservations[].Instances[].[InstanceId,State.Name]' \
    --output text | grep -q shutting-down); do \
        echo "waiting for instances to terminate"; sleep 5s
done

aws ec2 detach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID
aws ec2 delete-internet-gateway --internet-gateway-id $IGW_ID

aws ec2 delete-security-group --group-id $SECURITY_GROUP_ID

for SUBNET in ${SUBNETS[@]}; do
    aws ec2 delete-subnet --subnet-id $SUBNET
done

aws ec2 delete-vpc --vpc-id $VPC_ID

rm -f controlplane.yaml worker.yaml talosconfig kubeconfig time-server-patch.yaml disk.raw