AWS Interview Preparation

Data Pipeline

- AWS Data Pipeline helps you move, integrate, and process data across AWS compute and storage resources, as well as your on-premises resources. AWS Data Pipeline supports integration of data and activities across multiple AWS regions.
- AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.
             AWS Data Pipeline functional overview

Following components of AWS Data Pipeline work together to manage your data:

  • pipeline definition specifies the business logic of your data management. 

  • pipeline schedules and runs tasks by creating Amazon EC2 instances to perform the defined work activities. You upload your pipeline definition to the pipeline, and then activate the pipeline. You can edit the pipeline definition for a running pipeline and activate the pipeline again for it to take effect. You can deactivate the pipeline, modify a data source, and then activate the pipeline again. When you are finished with your pipeline, you can delete it.

  • Task Runner polls for tasks and then performs those tasks. For example, Task Runner could copy log files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline. 

A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the following information:

  • Names, locations, and formats of your data sources

  • Activities that transform the data

  • The schedule for those activities

  • Resources that run your activities and preconditions

  • Preconditions that must be satisfied before the activities can be scheduled

  • Ways to alert you with status updates as pipeline execution proceeds

From your pipeline definition, AWS Data Pipeline determines the tasks, schedules them, and assigns them to task runners. If a task is not completed successfully, AWS Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you.

For example, in your pipeline definition, you might specify that log files generated by your application are archived each month in 2013 to an Amazon S3 bucket. AWS Data Pipeline would then create 12 tasks, each copying over a month's worth of data, regardless of whether the month contained 30, 31, 28, or 29 days.

You can create a pipeline definition in the following ways:

  • Graphically, by using the AWS Data Pipeline console

  • Textually, by writing a JSON file in the format used by the command line interface

  • Programmatically, by calling the web service with either one of the AWS SDKs or the AWS Data Pipeline API

A pipeline definition can contain the following types of components.

Pipeline Components

Data Nodes

The location of input data for a task or the location where output data is to be stored.


A definition of work to perform on a schedule using a computational resource and typically input and output data nodes.


A conditional statement that must be true before an action can run.

Scheduling Pipelines

Defines the timing of a scheduled event, such as when an activity runs.


The computational resource that performs the work that a pipeline defines.


An action that is triggered when specified conditions are met, such as the failure of an activity.

There are three types of items associated with a scheduled pipeline:

  • Pipeline Components — Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition. Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow. They can inherit properties from parent components. Relationships among components are defined by reference. Pipeline components define the rules of data management.

  • Instances — When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances. Each instance contains all the information for performing a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process.

  • Attempts — To provide robust data management, AWS Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts. Attempt objects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instance with a counter. AWS Data Pipeline performs retries using the same resources from the previous attempts, such as Amazon EMR clusters and EC2 instances.


Retrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipeline definitions provide conditions and thresholds to control retries. However, too many retries can delay detection of an unrecoverable failure because AWS Data Pipeline does not report failure until it has exhausted all the retries that you specify. The extra retries may accrue additional charges if they are running on AWS resources. As a result, carefully consider when it is appropriate to exceed the AWS Data Pipeline default settings that you use to control re-tries and related settings.

                         AWS Data Pipeline components, instances, and attempts

For more information, see Pipeline Definition File Syntax.

A task runner is an application that polls AWS Data Pipeline for tasks and then performs those tasks.

Task Runner is a default implementation of a task runner that is provided by AWS Data Pipeline. When Task Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelines that you have activated. When a task is assigned to Task Runner, it performs that task and reports its status back to AWS Data Pipeline.

The following diagram illustrates how AWS Data Pipeline and a task runner interact to process a scheduled task. A task is a discrete unit of work that the AWS Data Pipeline service shares with a task runner. It differs from a pipeline, which is a general definition of activities and resources that usually yields several tasks.

                 AWS Data Pipeline task lifecycle

There are two ways you can use Task Runner to process your pipeline:

  • AWS Data Pipeline installs Task Runner for you on resources that are launched and managed by the AWS Data Pipeline web service.

  • You install Task Runner on a computational resource that you manage, such as a long-running EC2 instance, or an on-premises server.

For more information about working with Task Runner, see Working with Task Runner.

In AWS Data Pipeline, a data node defines the location and type of data that a pipeline activity uses as input or output. AWS Data Pipeline supports the following types of data nodes:


A DynamoDB table that contains data for HiveActivity or EmrActivity to use.


An SQL table and database query that represent data for a pipeline activity to use.


Previously, MySqlDataNode was used. Use SqlDataNode instead.


An Amazon Redshift table that contains data for RedshiftCopyActivity to use.


An Amazon S3 location that contains one or more files for a pipeline activity to use.

AWS Data Pipeline supports the following types of databases:


A JDBC database.


An Amazon RDS database.


An Amazon Redshift database.

In AWS Data Pipeline, an activity is a pipeline component that defines the work to perform. AWS Data Pipeline provides several pre-packaged activities that accommodate common scenarios, such as moving data from one location to another, running Hive queries, and so on. Activities are extensible, so you can run your own custom scripts to support endless combinations.

AWS Data Pipeline supports the following types of activities:


Copies data from one location to another.


Runs an Amazon EMR cluster.


Runs a Hive query on an Amazon EMR cluster.


Runs a Hive query on an Amazon EMR cluster with support for advanced data filtering and support for S3DataNode and DynamoDBDataNode.


Runs a Pig script on an Amazon EMR cluster.


Copies data to and from Amazon Redshift tables.


Runs a custom UNIX/Linux shell command as an activity.


Runs a SQL query on a database.

Some activities have special support for staging data and database tables. For more information, see Staging Data and Tables with Pipeline Activities.

In AWS Data Pipeline, a precondition is a pipeline component containing conditional statements that must be true before an activity can run. For example, a precondition can check whether source data is present before a pipeline activity attempts to copy it. AWS Data Pipeline provides several pre-packaged preconditions that accommodate common scenarios, such as whether a database table exists, whether an Amazon S3 key is present, and so on. However, preconditions are extensible and allow you to run your own custom scripts to support endless combinations.

There are two types of preconditions: system-managed preconditions and user-managed preconditions. System-managed preconditions are run by the AWS Data Pipeline web service on your behalf and do not require a computational resource. User-managed preconditions only run on the computational resource that you specify using the runsOn or workerGroup fields. The workerGroup resource is derived from the activity that uses the precondition.

System-Managed Preconditions


Checks whether data exists in a specific DynamoDB table.


Checks whether a DynamoDB table exists.


Checks whether an Amazon S3 key exists.


Checks whether an Amazon S3 prefix is empty.

User-Managed Preconditions


Checks whether a data node exists.


Runs a custom Unix/Linux shell command as a precondition.

In AWS Data Pipeline, a resource is the computational resource that performs the work that a pipeline activity specifies. AWS Data Pipeline supports the following types of resources:


An EC2 instance that performs the work defined by a pipeline activity.


An Amazon EMR cluster that performs the work defined by a pipeline activity, such as EmrActivity.

Resources can run in the same region with their working dataset, even a region different than AWS Data Pipeline. For more information, see Using a Pipeline with Resources in Multiple Regions.

Resource Limits

AWS Data Pipeline scales to accommodate a huge number of concurrent tasks and you can configure it to automatically create the resources necessary to handle large workloads. These automatically created resources are under your control and count against your AWS account resource limits. For example, if you configure AWS Data Pipeline to create a 20-node Amazon EMR cluster automatically to process data and your AWS account has an EC2 instance limit set to 20, you may inadvertently exhaust your available backfill resources. As a result, consider these resource restrictions in your design or increase your account limits accordingly. For more information about service limits, see AWS Service Limits in the AWS General Reference.


The limit is one instance per Ec2Resource component object.

Supported Platforms

Pipelines can launch your resources into the following platforms:


Your resources run in a single, flat network that you share with other customers.


Your resources run in a virtual private cloud (VPC) that's logically isolated to your AWS account.

Your AWS account can launch resources either into both platforms or only into EC2-VPC, on a region by region basis. For more information, see Supported Platforms in the Amazon EC2 User Guide for Linux Instances.

If your AWS account supports only EC2-VPC, we create a default VPC for you in each AWS Region. By default, we launch your resources into a default subnet of your default VPC. Alternatively, you can create a nondefault VPC and specify one of its subnets when you configure your resources, and then we launch your resources into the specified subnet of the nondefault VPC.

When you launch an instance into a VPC, you must specify a security group created specifically for that VPC. You can't specify a security group that you created for EC2-Classic when you launch an instance into a VPC. In addition, you must use the security group ID and not the security group name to identify a security group for a VPC.

For more information about using a VPC with AWS Data Pipeline, see Launching Resources for Your Pipeline into a VPC.

Amazon EC2 Spot Instances with Amazon EMR Clusters and AWS Data Pipeline

Pipelines can use Amazon EC2 Spot Instances for the task nodes in their Amazon EMR cluster resources. By default, pipelines use On-Demand Instances. Spot Instances let you use spare EC2 instances and run them. The Spot Instance pricing model complements the On-Demand and Reserved Instance pricing models, potentially providing the most cost-effective option for obtaining compute capacity, depending on your application. For more information, see the Amazon EC2 Spot Instances product page.

When you use Spot Instances, AWS Data Pipeline submits your Spot Instance maximum price to Amazon EMR when your cluster is launched. It automatically allocates the cluster's work to the number of Spot Instance task nodes that you define using the taskInstanceCount field. AWS Data Pipeline limits Spot Instances for task nodes to ensure that on-demand core nodes are available to run your pipeline.

You can edit a failed or completed pipeline resource instance to add Spot Instances. When the pipeline re-launches the cluster, it uses Spot Instances for the task nodes.

Spot Instances Considerations

When you use Spot Instances with AWS Data Pipeline, the following considerations apply:

  • Your Spot Instances can terminate when the Spot Instance price goes above your maximum price for the instance, or due to Amazon EC2 capacity reasons. However, you do not lose your data because AWS Data Pipeline employs clusters with core nodes that are always On-Demand Instances and not subject to termination.

  • Spot Instances can take more time to start as they fulfill capacity asynchronously. Therefore, a Spot Instance pipeline could run more slowly than an equivalent On-Demand Instance pipeline.

  • Your cluster might not run if you do not receive your Spot Instances, such as when your maximum price is too low.

AWS Data Pipeline actions are steps that a pipeline component takes when certain events occur, such as success, failure, or late activities. The event field of an activity refers to an action, such as a reference to snsalarm in the onLateAction field of EmrActivity.

AWS Data Pipeline relies on Amazon SNS notifications as the primary way to indicate the status of pipelines and their components in an unattended manner. For more information, see Amazon SNS In addition to SNS notifications, you can use the AWS Data Pipeline console and CLI to obtain pipeline status information.

AWS Data Pipeline supports the following actions:


An action that sends an SNS notification to a topic based on onSuccessOnFail, and onLateAction events.


An action that triggers the cancellation of a pending or unfinished activity, resource, or data node. You cannot terminate actions that include onSuccessOnFail, or onLateAction.

AWS Data Pipeline helps you sequence, schedule, run, and manage recurring data processing workloads reliably and cost-effectively. This service makes it easy for you to design extract-transform-load (ETL) activities using structured and unstructured data, both on-premises and in the cloud, based on your business logic.

To use AWS Data Pipeline, you create a pipeline definition that specifies the business logic for your data processing. A typical pipeline definition consists of activities that define the work to perform, data nodes that define the location and type of input and output data, and a schedule that determines when the activities are performed.

In this tutorial, you run a shell command script that counts the number of GET requests in Apache web server logs. This pipeline runs every 15 minutes for an hour, and writes output to Amazon S3 on each iteration.


Before you begin, complete the tasks in Setting up for AWS Data Pipeline.

Pipeline Objects

The pipeline uses the following objects:


Reads the input log file and counts the number of errors.

S3DataNode (input)

The S3 bucket that contains the input log file.

S3DataNode (output)

The S3 bucket for the output.


The compute resource that AWS Data Pipeline uses to perform the activity.

Note that if you have a large amount of log file data, you can configure your pipeline to use an EMR cluster to process the files instead of an EC2 instance.


Defines that the activity is performed every 15 minutes for an hour.

Create the Pipeline

The quickest way to get started with AWS Data Pipeline is to use a pipeline definition called a template.

To create the pipeline

  1. Open the AWS Data Pipeline console at

  2. From the navigation bar, select a region. You can select any region that's available to you, regardless of your location. Many AWS resources are specific to a region, but AWS Data Pipeline enables you to use resources that are in a different region than the pipeline.

  3. The first screen that you see depends on whether you've created a pipeline in the current region.

    1. If you haven't created a pipeline in this region, the console displays an introductory screen. Choose Get started now.

    2. If you've already created a pipeline in this region, the console displays a page that lists your pipelines for the region. Choose Create new pipeline.

  4. In Name, enter a name for your pipeline.

  5. (Optional) In Description, enter a description for your pipeline.

  6. For Source, select Build using a template, and then select the following template: Getting Started using ShellCommandActivity.

  7. Under the Parameters section, which opened when you selected the template, leave S3 input folder and Shell command to run with their default values. Click the folder icon next to S3 output folder, select one of your buckets or folders, and then click Select.

  8. Under Schedule, leave the default values. When you activate the pipeline the pipeline runs start, and then continue every 15 minutes for an hour.

    If you prefer, you can select Run once on pipeline activation instead.

  9. Under Pipeline Configuration, leave logging enabled. Choose the folder icon under S3 location for logs, select one of your buckets or folders, and then choose Select.

    If you prefer, you can disable logging instead.

  10. Under Security/Access, leave IAM roles set to Default.

  11. Click Activate.

    If you prefer, you can choose Edit in Architect to modify this pipeline. For example, you can add preconditions.

Monitor the Running Pipeline

After you activate your pipeline, you are taken to the Execution details page where you can monitor the progress of your pipeline.

To monitor the progress of your pipeline

  1. Click Update or press F5 to update the status displayed.


    If there are no runs listed, ensure that Start (in UTC) and End (in UTC) cover the scheduled start and end of your pipeline, and then click Update.

  2. When the status of every object in your pipeline is FINISHED, your pipeline has successfully completed the scheduled tasks.

  3. If your pipeline doesn't complete successfully, check your pipeline settings for issues. For more information about troubleshooting failed or incomplete instance runs of your pipeline, see Resolving Common Problems.

View the Output

Open the Amazon S3 console and navigate to your bucket. If you ran your pipeline every 15 minutes for an hour, you'll see four time-stamped subfolders. Each subfolder contains output in a file named output.txt. Because we ran the script on the same input file each time, the output files are identical.

Delete the Pipeline

To stop incurring charges, delete your pipeline. Deleting your pipeline deletes the pipeline definition and all associated objects.

To delete your pipeline

  1. On the List Pipelines page, select your pipeline.

  2. Click Actions, and then choose Delete.

  3. When prompted for confirmation, choose Delete.

If you are finished with the output from this tutorial, delete the output folders from your Amazon S3 bucket.

For example, you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon EMR (Amazon EMR) cluster over those logs to generate traffic reports. AWS Data Pipeline schedules the daily tasks to copy data and the weekly task to launch the Amazon EMR cluster. AWS Data Pipeline also ensures that Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it begins its analysis, even if there is an unforeseen delay in uploading the logs.

Define Data Nodes
Select input and output data out of S3, DynamoDB, Redshift, RDS, and JDBC sources.
Learn more
Schedule Compute Activities
Configure the activities that will process data from EMR, Hive, Pig, SQL, and Shell scripts.
Learn more
Activate & Monitor
Activate your pipeline, then let AWS Data Pipeline manage the pipeline execution, resources, retry logic and failure notifications for you.

Accessing AWS Data Pipeline

You can create, access, and manage your pipelines using any of the following interfaces:

  • AWS Management Console— Provides a web interface that you can use to access AWS Data Pipeline.

  • AWS Command Line Interface (AWS CLI) — Provides commands for a broad set of AWS services, including AWS Data Pipeline, and is supported on Windows, macOS, and Linux. For more information about installing the AWS CLI, see AWS Command Line Interface. For a list of commands for AWS Data Pipeline, see datapipeline.

  • AWS SDKs — Provides language-specific APIs and takes care of many of the connection details, such as calculating signatures, handling request retries, and error handling. For more information, see AWS SDKs.

  • Query API— Provides low-level APIs that you call using HTTPS requests. Using the Query API is the most direct way to access AWS Data Pipeline, but it requires that your application handle low-level details such as generating the hash to sign the request, and error handling. For more information, see the AWS Data Pipeline API Reference.


With Amazon Web Services, you pay only for what you use. For AWS Data Pipeline, you pay for your pipeline based on how often your activities and preconditions are scheduled to run and where they run. For more information, see AWS Data Pipeline Pricing.

If your AWS account is less than 12 months old, you are eligible to use the free tier. The free tier includes three low-frequency preconditions and five low-frequency activities per month at no charge. For more information, see AWS Free Tier

AWS Data Pipeline works with the following services to store data.

AWS Data Pipeline works with the following compute services to transform data.

  • Amazon EC2 — Provides resizable computing capacity—literally, servers in Amazon's data centers—that you use to build and host your software systems. For more information, see Amazon EC2 User Guide for Linux Instances.

  • Amazon EMR — Makes it easy, fast, and cost-effective for you to distribute and process vast amounts of data across Amazon EC2 servers, using a framework such as Apache Hadoop or Apache Spark. For more information, see Amazon EMR Developer Guide.

When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable Amazon EC2 instances. Each instance contains all the information for performing a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process.

EC2 instances come in different configurations, which are known as instance types. Each instance type has a different CPU, input/output, and storage capacity. In addition to specifying the instance type for an activity, you can choose different purchasing options. Not all instance types are available in all AWS Regions. If an instance type is not available, your pipeline may fail to provision or may be stuck provisioning. For information about instance availability, see the Amazon EC2 Pricing Page. Open the link for your instance purchasing option and filter by Region to see if an instance type is available in the Region. For more information about these instance types, families, and virtualization types, see Amazon EC2 Instances and Amazon Linux AMI Instance Type Matrix.

The following tables describe the instance types that AWS Data Pipeline supports. You can use AWS Data Pipeline to launch Amazon EC2 instances in any Region, including Regions where AWS Data Pipeline is not supported. For information about Regions where AWS Data Pipeline is supported, see AWS Regions and Endpoints.

Default Amazon EC2 Instances by AWS Region

If you do not specify an instance type in your pipeline definition, AWS Data Pipeline launches an instance by default.

The following table lists the Amazon EC2 instances that AWS Data Pipeline uses by default in those Regions where AWS Data Pipeline is supported.

Region NameRegionInstance Type
US East (N. Virginia)us-east-1m1.small
US West (Oregon)us-west-2m1.small
Asia Pacific (Sydney)ap-southeast-2m1.small
Asia Pacific (Tokyo)ap-northeast-1m1.small
EU (Ireland)eu-west-1m1.small

The following table lists the Amazon EC2 instances that AWS Data Pipeline launches by default in those Regions where AWS Data Pipeline is not supported.

Region NameRegionInstance Type
US East (Ohio)us-east-2t2.small
US West (N. California)us-west-1m1.small
Asia Pacific (Mumbai)ap-south-1t2.small
Asia Pacific (Singapore)ap-southeast-1m1.small
Asia Pacific (Seoul)ap-northeast-2t2.small
Canada (Central)ca-central-1t2.small
EU (Frankfurt)eu-central-1t2.small
EU (London)eu-west-2t2.small
EU (Paris)eu-west-3t2.small
South America (São Paulo)sa-east-1m1.small