Deploying Immuta on Amazon EMR
Audience: System Administrators
Content Summary: This tutorial will guide you through the process of spinning up an Amazon Elastic Map Reduce cluster with Immuta's Hadoop and Spark security plugins installed.
Introduction
This tutorial contains examples using the AWS CLI. These examples are conceptual in nature and will require modification to adapt to your exact deployment needs. If you wish to quickly familiarize yourself with Immuta's EMR integration, please visit the Quickstart Installation Guide for Immuta on AWS EMR.
Supported EMR Versions
This deployment is tested and known to work on the EMR releases listed below.
- 5.17.0
- 5.18.0
- 5.19.0
- 5.20.0
- 5.21.0
- 5.22.0
- 5.23.0
- 5.24.0
- 5.25.0
- 5.26.0
- 5.27.0
- 5.28.0
- 5.29.0
- 5.30.0
- 5.31.0
- 5.32.0
Create Prerequisite AWS Resources
In addition to the EMR cluster itself, Immuta requires a handful of additional AWS resources in order to function properly.
Immuta Bootstrap Bucket
In order to bootstrap the EMR cluster with Immuta's software bundle and startup scripts, you will need to create an S3 bucket to hold these artifacts.
In this guide, the bucket is referenced by the placeholder $BOOTSTRAP_BUCKET.
You should substitute this bucket name for a unique bucket name of your choosing.
The bucket must contain all artifacts
listed below. These artifacts can be found at Immuta Downloads.
s3://$BOOTSTRAP_BUCKET/immuta-bootstrap
s3://$BOOTSTRAP_BUCKET/immuta-bootstrap.tar.gz
s3://$BOOTSTRAP_BUCKET/immuta_bundle-$IMMUTA_VERSION.tar.gz
s3://$BOOTSTRAP_BUCKET/install.sh
Immuta Data IAM Role
Immuta's Spark integration relies on an IAM role policy that has access to the S3 buckets where your sensitive data is stored. Note that the EC2 Instance Roles for your EMR cluster should not have access to these buckets. Immuta will broker access to the data in these buckets to authorized users.
Create Immuta Data IAM Policy
Modify the JSON data below to include the correct name of your data bucket(s), and save as
immuta_data_iam_policy.json.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        }
    ]
}
If you are leveraging Immuta's Native S3 Workspace capability, you must also give the Immuta data IAM role full control of the workspace bucket or folder.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Head*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::$DATA_BUCKET_1",
                "arn:aws:s3:::$DATA_BUCKET_2",
                "arn:aws:s3:::$DATA_BUCKET_1/*",
                "arn:aws:s3:::$DATA_BUCKET_2/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::$WORKSPACE_BUCKET",
                "arn:aws:s3:::$WORKSPACE_BUCKET/*"
            ]
        }
    ]
}
Now you can run the following command to create the Immuta IAM user policy.
aws iam create-policy \
    --policy-name immuta_emr_data_policy \
    --policy-document file://immuta_data_iam_policy.json
Create Immuta Data IAM Role
The IAM role that brokers access to S3 data must be able to assume the cluster node instance roles, and vice versa. Since this a cycle, you will need to create both roles with generic trust policies, and then update them after both roles are created.
Create a file called immuta_data_role_trust_policy_generic.json as seen below.
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/EMR_EC2_DefaultRole"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
After creating the immuta_data_role_trust_policy_generic.json file from above,
run the following command to create the
Immuta data IAM role. Note that you will be using the generic IAM role trust policy that you created in the
previous step. This will be updated when both the data and instance IAM roles are created.
aws iam create-role \
  --role-name immuta_emr_data_role \
  --assume-role-policy-document "file://immuta_data_role_trust_policy_generic.json"
Next you will need to attach the IAM policy that allows access to your protected data in S3.
aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_data_policy \
    --role-name immuta_emr_data_role
Create Immuta Instance IAM Policy
Modify the JSON data below to include the correct name of your bootstrap bucket, and save as
immuta_emr_instance_policy.json.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "ec2:Describe*",
                "elasticmapreduce:Describe*",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:ListSteps"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": "arn:aws:sqs:*:123456789012:AWS-ElasticMapReduce-*",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:PurgeQueue",
                "sqs:ReceiveMessage"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::$BOOTSTRAP_BUCKET"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "secretsmanager:*",
            "Resource": [
                "arn:aws:secretsmanager:$AWS_REGION:$AWS_ACCOUNT_ID:secret:immuta-emr-secret-??????",
                "arn:aws:secretsmanager:$AWS_REGION>:$AWS_ACCOUNT_ID:secret:immuta-kerberos-secret-??????"
            ]
        }
    ]
}
Note that the above policy is derived from the Minimal EMR role for EC2 (instance profile) policy described
in Amazon's
Best Practices for Securing Amazon EMR
guide. You may need to tune this policy based on your organization's environment and needs.
After creating the immuta_emr_instance_policy.json file from above, run the following command to create the
Immuta EMR Instance policy.
aws iam create-policy \
    --policy-name immuta_emr_instance_policy \
    --policy-document file://immuta_emr_instance_policy.json
Create Immuta Instance IAM Role
The node instance IAM role must be able to assume the IAM role that brokers access to S3 data, and vice versa.
Assuming you have already created the immuta_emr_data_role, create a JSON file called
instance_role_trust_policy.json as shown below.
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACOUNT_ID:role/immuta_emr_data_role",
            "Service": "ec2.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
Now you can create the instance role with the policy document from above.
aws iam create-role \
  --role-name immuta_emr_instance_role \
  --assume-role-policy-document "file://instance_role_trust_policy.json"
Next you will need to attach the IAM policy that allows access to required resources for your cluster.
aws iam attach-role-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/immuta_emr_instance_policy \
    --role-name immuta_emr_instance_role
Create Immuta EMR Instance Profile
After creating the role and policy for the Immuta instances, you can create the Immuta EC2 Instance Profile.
aws iam create-instance-profile \
    --instance-profile-name immuta_emr_instance_profile
After creating the Instance Profile, you can attach the newly created Role.
aws iam add-role-to-instance-profile \
    --instance-profile-name immuta_emr_instance_profile \
    --role-name immuta_emr_instance_role
Update Immuta Data IAM Role Trust Policy
Now that both the data and instance IAM roles are created, you can update the trust policy of the data IAM role to include the instance role.
Create a file called data_role_trust_policy.json as shown below.
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS":"arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_instance_role"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
Now you can update the trust policy of the data IAM role.
aws iam update-assume-role-policy \
  --role-name immuta_emr_data_role \
  --policy-document "file://data_role_trust_policy.json"
Immuta HDFS System Token in AWS Secrets Manager
Navigate to the App Settings page and generate an Immuta HDFS System Token. Copy the value generated by Immuta, and create a new secret in AWS Secrets Manager as shown below.
aws secretsmanager create-secret \
    --name immuta-emr-secret \
    --secret-string $HDFS_SYSTEM_TOKEN
Create EMR Cluster
EC2 Attributes Configuration File
Complete the JSON template below and save as ec2_attributes.json. You may remove keys where
you would like to use default values.
When choosing security groups for your master and worker nodes, be sure that they provide bi-directional access between the nodes and your Immuta instance.
{
  "ServiceAccessSecurityGroup": "string",
  "AvailabilityZone": "string",
  "AdditionalSlaveSecurityGroups": ["string", ...],
  "EmrManagedMasterSecurityGroup": "string",
  "KeyName": "<the name of your SSH public key stored in AWS>",
  "InstanceProfile": "immuta_emr_instance_profile",
  "SubnetId": "string",
  "AdditionalMasterSecurityGroups": ["string", ...],
  "AvailabilityZones": ["string", ...],
  "EmrManagedSlaveSecurityGroup": "string"
}
Cluster Configuration File
Immuta requires a custom configuration file for Hadoop services to be passed in to the cluster.
The required configurations are displayed below. Modify the JSON data to match your environment and
save as cluster_configuration.json.
[
   {
      "Classification":"hdfs-site",
      "Properties":{
         "dfs.namenode.inode.attributes.provider.class":"com.immuta.hadoop.ImmutaInodeAttributeProvider",
         "dfs.namenode.acls.enabled":"true",
         "immuta.extra.name.node.plugin.config":"file:///opt/immuta/hadoop/name-node-conf.xml"
      },
      "Configurations":[]
   },
   {
      "Classification":"emrfs-site",
      "Properties":{
         "fs.s3.customAWSCredentialsProvider":"com.immuta.emr.ImmutaEMRAWSCredentialsProvider"
      },
      "Configurations":[]
   },
   {
      "Classification":"core-site",
      "Properties":{
         "immuta.permission.users.to.ignore":"hdfs,yarn,hive,impala,llama,mapred,spark,oozie,hue,hbase,hadoop",
         "fs.immuta.impl":"com.immuta.hadoop.ImmutaFileSystem",
         "hadoop.proxyuser.immuta_emr.groups":"*",
         "hadoop.proxyuser.immuta_emr.users":"*",
         "hadoop.proxyuser.immuta_emr.hosts":"*",
         "hadoop.proxyuser.immuta.groups":"*",
         "hadoop.proxyuser.immuta.users":"*",
         "hadoop.proxyuser.immuta.hosts":"*",
         "immuta.cluster.name":"my_cluster",
         "immuta.spark.partition.generator.user":"immuta_emr",
         "immuta.credentials.dir":"/user",
         "immuta.base.url":"https://immuta.mycompany.com"
      },
      "Configurations":[]
   },
   {
      "Classification":"hadoop-env",
      "Properties":{},
      "Configurations":[
         {
            "Classification":"export",
            "Properties":{
               "HADOOP_CLASSPATH":"$HADOOP_CLASSPATH:/opt/immuta/hadoop/lib/immuta-inode-attribute-provider.jar:/opt/immuta/hadoop/lib/immuta-hadoop-filesystem.jar:/opt/immuta/hadoop/lib/immuta-emrfs-credential-provider.jar",
               "JAVA_HOME":"/usr/lib/jvm/java-1.8.0"
            },
            "Configurations":[]
         }
      ]
   },
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.server2.enable.doAs":"true",
         "hive.security.metastore.authorization.auth.reads": "false",
         "hive.compute.query.using.stats": "true"
      },
      "Configurations":[]
   },
   {
      "Classification": "capacity-scheduler",
      "Properties": {
         "yarn.scheduler.capacity.root.default.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.immuta_spark.default-node-label-expression": "CORE",
         "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity": "30",
         "yarn.scheduler.capacity.root.queues": "default,immuta_spark",
         "yarn.scheduler.capacity.root.immuta_spark.accessible-node-labels.CORE.capacity": "70",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-applications": "100",
         "yarn.scheduler.capacity.root.immuta_spark.maximum-am-resource-percent": "0.1",
         "yarn.scheduler.capacity.root.immuta_spark.capacity": "0",
         "yarn.scheduler.capacity.root.default.capacity": "100"
      },
      "Configurations": []
   }
]
Immuta Bootstrap Configuration File
Next, create a file called bootstrap_actions.json to configure the Immuta bootstrap action.
If you have any additional bootstrap actions to run outside of Immuta, they should be added here
as well.
[
  {
    "Path": "s3://$BOOTSTRAP_BUCKET/immuta-bootstrap",
    "Args": [
        "--immuta-instance-url=https://immuta.mycompany.com",
        "--immuta-secret-name=immuta-emr-secret",
        "--immuta-user-name=immuta_emr",
        "--immuta-bootstrap-archive=s3://$BOOTSTRAP_BUCKET/immuta_bootstrap.tar.gz",
        "--immuta-software-bundle=s3://$BOOTSTRAP_BUCKET/immuta_bundle.tar.gz",
        "--immuta-install-script=s3://$BOOTSTRAP_BUCKET/install.sh",
        "--kerberos",
        "--kerberos-secret-name immuta-kerberos-secret"
    ],
    "Name": "Immuta Bootstrap"
  }
]
(Optional) Kerberos Attributes Configuration File
If you wish to deploy a kerberized cluster, create a kerberos_attributes.json file with your desired
Kerberos configurations. Note that although not strictly required, a cluster without Kerberos should be considered
secure for production.
{
  "Realm": "EC2.INTERNAL",
  "KdcAdminPassword": "secret"
}
Security Configuration
You will need to create a security configuration before creating the EMR cluster so that Immuta's EMRFS integration can leverage the IAM role you created to access data in S3.
First, create a security_configuration.json file with your desired security settings. A basic example
with a cluster-dedicated KDC for Kerberos is shown below. Note that you are allowing the following system users
to use the data IAM role: hadoop, hive, and immuta_emr. Data Owners must also have access to this data to
use the Immuta Query Engine. This example grants access to any user in the fictional data_owners group.
See the official
AWS Documentation
for more details on configuring IAM roles for EMRFS.
{
  "AuthenticationConfiguration": {
    "KerberosConfiguration": {
      "Provider": "ClusterDedicatedKdc",
      "ClusterDedicatedKdcConfiguration": {
        "TicketLifetimeInHours": 24
      }
    }
  },
  "AuthorizationConfiguration": {
    "EmrFsConfiguration": {
      "RoleMappings": [
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "User",
          "Identifiers": ["hadoop","hive","immuta_emr"]
        },
        {
          "Role": "arn:aws:iam::$AWS_ACCOUNT_ID:role/immuta_emr_data_role",
          "IdentifierType": "Group",
          "Identifiers": ["data_owners"]
        }
      ]
    }
  }
}
Next, create your security configuration with the following command.
aws emr create-security-configuration \
    --name immuta_emr_security_configuration \
    --security-configuration file://security_configuration.json
Create EMR Cluster Command
Finally, you can now spin up an EMR cluster with Immuta's security plugins.
aws emr create-cluster \
    --name immuta-emr \
    --release-label emr-5.28.0 \
    --configuration file://cluster_configuration.json \
    --ec2-attributes file://ec2_attributes.json \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions file://bootstrap_actions.json \
    --kerberos-attributes file://kerberos_attributes.json \
    --security-configuration immuta_emr_security_configuration \
    --service-role EMR_DefaultRole
Remove Secrets
To ensure protection of the Immuta user's AWS credentials as well as the kadmin password (if using Kerberos),
it is recommended to overwrite the secret values that were created during cluster deployment process. If you
leave the secret values in AWS Secrets Manager, cluster users may be able to assume the instance role
of the EMR nodes and read these values.
It is safe to remove these values after the cluster has finished bootstrapping.
The example below overwrites the relevant secrets with null values.
aws secretsmanager put-secret-value \
    --secret-id immuta-emr-secret \
    --secret-binary null
aws secretsmanager put-secret-value \
    --secret-id immuta-kerberos-secret \
    --secret-string null
Note that if you are using an external KDC without a cross-realm trust (no KDC on the cluster),
you should put the kadmin password back into the immuta-kerberos-secret. This is required to clean up
the Immuta services principals that will have been created on the external KDC.