Creating AWS EC2 instances with specific public IP via AWS Data Pipeline

Problem

Sometimes we would want to create AWS EC2 instances which are launched by AWS Data pipeline to have custom IP addresses. One simple use case is when you want to connect to redshift in a different AWS account with your pipeline’s EC2 instance and the redshift in other account only provides access to specific IP addresses through its security group settings. In such cases, EC2 instances launched by Data Pipelines needs to have public IP addresses which are allowed by the redshift instance in other account.

Understanding EIP, VPC, Subnet, NAT, Internet Gateway

VPC: Virtual Private Cloud: virtual network dedicated to the AWS account. While creating VPC, one can specify the range of IP addresses(CIDR) which can be assigned inside the virtual network. For example if you specify CIDR as 10.0.0.0/26, it means one can have (2^(32-26) =2^6=64) 64 IP addresses ranging from 10.0.0.0-10.0.0.63 within VPC.

Subnet with VPC: are containers within VPC that segment off a slice of the CIDR block defined in VPC. Subnets allow to give different access rules and place resources in different containers where those rules should apply. For above VPC, we can have a subnet as 10.0.0.0/28 or 10.0.0.16/28 or 10.0.0.32/28 or 10.0.0.48/28(each having 16 IP addresses within VPC 10.0.0.0/26)

private subnet: are the subnets which do not connect to the internet/external systems.

public subnet: can connect to the external internet via Internet Gateway and vice-versa, this creates risks since any external system can attack the public subnet.

EIP(Elastic IP): If we require a persistent public IP address that can be assigned to and removed from instances as required, we use an Elastic IP address. To do this, we must allocate an Elastic IP address for use with the VPC, and then associate that Elastic IP address with a private IP address specified by the network interface attached to the instance.

NAT: provides access to private subnet for internet, external system cannot access private subnet since NAT allows only one way connection from private subnet to internet. NAT can be associated to an EIP: which is like a fixed IP for all the traffic going from the subnet via NAT. Please note that NAT uses one of the IPs of the private subnet defined within VPC as input interface.

Route Tables: where we specify the association of subnets with NAT(private VPC)/internet Gateways(public subnet)

VPC

Solutions

1. Allocating instance with a specific EIP: Within data pipeline activity, after the EC2 resource gets launched, we can associate the Elastic IP to the instance. The EIP is whitelisted in ABBD redshift Inbound Security Groups to allow the connections with the cluster. Sample shell script:

#!/bin/sh
# Associate elastic IP :
INSTANCE=$( wget -q -O – http://instance-data/latest/meta-data/instance-id )
aws ec2 associate-address –instance-id $INSTANCE –allocation-id <elastic IP-eipalloc-ID>

Pros:

Simple to implement, just need to add a new activity with the shell script assigning the whitelisted EIP to the instance.

Cons:

Cannot have multiple pipelines running at the same time, since EIP can be assigned only to one instance at a particular time. So if other other EC2 instance requests for same EIP allocation, that might lead to failures. This limits the maximum number of concurrent data pipelines execution. Although we can solve this problem by creating multiple EIP(one for each pipeline), but this is not a scalable solution.

2. Launching a whitelisted IP instance Running Task Runner: We can manually launch an EC2 instance, install task runner, associate an EIP and then use this EC2 instance to perform Data pipeline activities. Details here

Cons:

This approach is more scalable than the previous one but still limits the number of data pipeline activities concurrent execution by the CPU/Memory of the EC2 single box where task runner is running.

3. Using NAT gateway: This is the most scalable way to solve this problem among all the ones discussed here. In this approach, we create a NAT gateway and associate a whitelisted EIP to the same. After that, we create a private subnet in same VPC(where NAT has been created) and allow all the traffic in the private subnet to go through NAT. Now in data pipeline, we create EC2 instances only in this private subnet. This way all the traffic from any instance created by data pipeline goes through NAT which has the public IP whitelisted from the other AWS account.

Steps for Solution 3

Steps Involved

1. Create a new VPC :

Screen Shot 2018-10-29 at 8.50.24 PM.png

2. Create private and public subnets within VPC, Please note that AWS reserves 5 IP’s for its internal usages within the subnets:

Screen Shot 2018-10-29 at 8.52.15 PM.png

Screen Shot 2018-10-29 at 8.53.06 PM.png

3. Create an internet gateway and associate gateway to the public subnet created in above step.

4. Create NAT gateway and associate a new EIP. Choose private subnet created in above step, NAT uses one of the IP’s in private subnet:

Screen Shot 2018-10-29 at 8.48.49 PM.png

4. Associate NAT to all the traffic of private subnet:

Screen Shot 2018-10-29 at 8.58.02 PM

5. Use private subnet id in Data Pipeline. Now EC2 will be created inside the private subnet and all the traffic for the instance will got via NAT whose public ID needs to be whitelisted in the Inbound Traffic of security group of redshift cluster.

Last step is to whitelist the EIP in the security group inbound of redshift cluster.

Published by

Shlok Chaurasia

Software developer at Amazon passionate about building high scale systems.

Leave a comment