Manual Databricks Installation
Audience: System Administrators
Content Summary: This guide details the manual installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.
Databricks Unity Catalog
If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
The immuta_conf.xml
is no longer required.
The immuta_conf.xml
file that was previously used to configure the native Databricks integration
is no longer required to install Immuta, so it is no longer staged as a deployment artifact.
However, you can use these snippets if
you wish to deploy an immuta_conf.xml
file to set properties.
The required Immuta base URL and Immuta system API key properties, along
with any other valid properties, can still
be specified as Spark environment variables or in the optional immuta_conf.xml
file. As before, if the same
property is specified in both locations, the Spark environment variable takes precedence.
If you have an existing immuta_conf.xml
file, you can continue using it. However, it's recommended that you delete
any default properties from the file that you have not explicitly overridden, or remove the file completely and rely
on Spark environment variables. Either method will ensure that any property defaults changed in upcoming Immuta
releases are propagated to your environment.
1 - Download and Configure Immuta Artifacts
- Navigate to the Immuta archives page. If you are prompted to log in and need basic authentication credentials, reach out to your Immuta support professional.
- Navigate to the Databricks folder for your Immuta version.
Ex: https://archives.immuta.com/hadoop/databricks/2022.5.13/
. -
Download the .jar file (Immuta plugin) as well as the other scripts listed below, which will load the plugin at cluster startup.
allowedCallingClasses.json immuta-benchmark-suite.dbc immuta-spark-hive-X.X.X_YYYYMMDD-hadoop-Z.Z.Z-public.jar immuta_cluster_init_script.sh obscuredCommands.yaml
The
immuta-benchmark-suite.dbc
is a collection of notebooks packaged as a .dbc file. After you have added cluster policies to your cluster, you can import this file into Databricks to run performance tests and compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.Spark Version
Use Spark 2 with Databricks Runtime prior to 7.x. Use Spark 3 with Databricks Runtime 7.x or later. Attempting to use an incompatible jar and Databricks Runtime will fail.
-
Specify the following properties as Spark environment variables or in the optional
immuta_conf.xml
file. If the same property is specified in both locations, the Spark environment variable takes precedence. The variable names are the config names in all upper case with_
instead of.
. For example, to set the value ofimmuta.base.url
via an environment variable, you would set the following in theEnvironment Variables
section of cluster configuration:IMMUTA_BASE_URL=https://immuta.mycompany.com
-
immuta.system.api.key
: Obtain this value from the Immuta Configuration UI under HDFS > System API Key. You will need to be a user with theAPPLICATION_ADMIN
role to complete this action.Danger
Generating a key will destroy any previously generated HDFS keys. This will cause previously integrated HDFS systems to lose access to your Immuta console. The key will only be shown once when generated.
-
immuta.base.url
: The full URL for the target Immuta instanceEx: https://immuta.mycompany.com
. immuta.user.mapping.iamid
: If users authenticate to Immuta using an IAM different from Immuta's built-in IAM, you need to update the configuration file to reflect the ID of that IAM. The IAM ID is shown within the Immuta App Settings page within the Identity Management section. See Databricks to Immuta User Mapping for more details.
-
Environment Variables with Google Cloud Platform
Do not use environment variables to set sensitive properties when using Google Cloud Platform. Set them directly
in immuta_conf.xml
.
2 - Stage Immuta Artifacts
When configuring the Databricks cluster, a path will need to be provided to each of the artifacts downloaded/created in the previous step. To do this, those artifacts must be hosted somewhere that your Databricks instance can access. The following methods can be used for this step:
- Host files in AWS/S3 and provide access by the cluster
- Host files in Azure ADL Gen 1 or Gen 2 and provide access by the cluster
- Host files on an HTTPS server accessible by the cluster
- Host files in DBFS (Not recommended for production)
These artifacts will be downloaded to the required location within the clusters file-system by the init script downloaded in the previous step. In order for the init script to find these files, a URI will have to be provided through environment variables configured on the cluster. Each method's URI structure and setup is explained below.
AWS/S3
URI Structure: s3://[bucket]/[path]
- Create an instance profile for clusters by following Databricks documentation.
- Upload the configuration file, JSON file, and JAR file to an S3 bucket that the role from step 1 has access to.
Authenticating with Access Keys or Session Tokens (Optional)
If you wish to authenticate using access keys, add the following items to the cluster's environment variables:
IMMUTA_INIT_AWS_SECRET_ACCESS_KEY=<aws secret key>
IMMUTA_INIT_AWS_ACCESS_KEY_ID=<aws access key id>
If you've assumed a role and received a session token, that can be added here as well:
IMMUTA_INIT_AWS_SESSION_TOKEN=<aws session token>
Azure
ADL Gen 2
URI Structure: abfs(s)://[container]@[account].dfs.core.windows.net/[path]
Upload the configuration file, JSON file, and JAR file to an ADL gen 2 blob container.
Environment Variables:
If you want to authenticate using an account key, add the following to your cluster's environment variables:
IMMUTA_INIT_AZCOPY_CRED_TYPE=SharedKey
IMMUTA_INIT_ACCOUNT_NAME=<ADLg2 account name>
IMMUTA_INIT_ACCOUNT_KEY=<ADLg2 account key>
If you want to authenticate using an Azure SAS token, add the following to your cluster's environment variables:
IMMUTA_INIT_AZURE_SAS_TOKEN=<SAS token>
ADL Gen 1
URI Structure: adl://[account].azuredatalakestore.net/[path]
Upload the configuration file, JSON file, and JAR file to ADL gen 1.
Environment Variables:
If authenticating as an AD user,
IMMUTA_INIT_AZURE_AD_USER=<Microsoft Entra ID username>
IMMUTA_INIT_AZURE_PASSWORD=<Microsoft Entra ID password>
If authenticating using a service principal,
IMMUTA_INIT_AZURE_SERVICE_PRINCIPAL=<azure service principal>
IMMUTA_INIT_AZURE_PASSWORD=<azure service principal password>
IMMUTA_INIT_AZURE_TENANT=<tenant ID where principal was created>
HTTPS
URI Structure: http(s)://[host](:port)/[path]
Artifacts are available for download from https://archives.immuta.com. Your basic authentication credentials can be obtained from your Immuta support professional.
Environment Variables (Optional)
IMMUTA_INIT_HTTPS_USER=<basic auth username>
IMMUTA_INIT_HTTPS_PASSWORD=<basic auth password>
# Note: Credentials can also be included as part of the artifact URI. For example,
IMMUTA_INIT_JAR_URI=https://user:password@archives.immuta.com/path/to/file
DBFS
Warning
DBFS does not support access control. Any Databricks user can access DBFS via the Databricks command line utility. Files containing sensitive materials (such as Immuta API keys) should not be stored there in plain text. Use other methods described herein to properly secure such materials.
URI Structure: dbfs:/[path]
Upload the artifacts directly to DBFS using the Databricks CLI.
Since any user has access to everything in DBFS:
- The artifacts can be stored anywhere in DBFS.
- It's best to have a cluster-specific place for your artifacts in DBFS if you are testing to avoid overwriting or reusing someone else's artifacts accidentally.
3 - Protect Immuta Environment Variables with Databricks Secrets
It is important that non-administrator users on an Immuta-enabled Databricks cluster do not have
access to view or modify Immuta configuration or the immuta-spark-hive.jar
file, as this would potentially
pose a security loophole around Immuta policy enforcement. Therefore,
use Databricks secrets to apply
environment variables to an Immuta-enabled cluster in a secure way.
Databricks secrets can be used in the Environment Variables
configuration section for a cluster by
referencing the secret path rather than the actual value of the environment variable. For example,
if a user wanted to make the following value secret
MY_SECRET_ENV_VAR=super_secret_stuff
they could instead create a Databricks secret and reference it as the value of that variable. For instance,
if the secret scope my_secrets
was created, and the user added a secret with the key my_secret_env_var
containing
the desired sensitive environment variable, they would reference it in the Environment Variables
section:
MY_SECRET_ENV_VAR={{secrets/my_secrets/my_secret_env_var}}
Then, at runtime, {{secrets/my_secrets/my_secret_env_var}}
would be replaced with the actual value of the secret if
the owner of the cluster has access to that secret.
Best Practice: Replace Sensitive Variables with Secrets
Immuta recommends that ANY SENSITIVE environment variables listed below in the various artifact deployment instructions be replaced with secrets.
4 - Create and Configure the Cluster
Cluster creation in an Immuta-enabled organization or Databricks workspace should be limited to administrative users to avoid allowing users to create non-Immuta enabled clusters.
- Create a cluster in Databricks by following the Databricks documentation.
- Select the Custom Access mode.
- Opt to adjust the Autopilot Options and Worker Type settings. The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
- In the Advanced Options section, click the Instances tab.
- IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
-
Click the Spark tab. In Spark Config field, add your configuration.
-
Cluster Configuration Requirements:
spark.executor.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager / -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json / -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service spark.driver.extraJavaOptions -Djava.security.manager=com.immuta.security.ImmutaSecurityManager / -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json / -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service spark.databricks.repl.allowedLanguages python,sql spark.databricks.pyspark.enableProcessIsolation true spark.databricks.isv.product Immuta
-
-
In the Environment Variables section, add the environment variables necessary for your configuration. Remember that these variables should be protected with Databricks secrets as mentioned above.
# Specify the URI to the artifacts that were hosted in the previous steps # The URI must adhere to the supported types for each service mentioned above IMMUTA_INIT_JAR_URI=<Full URI to immuta-spark-hive.jar> IMMUTA_INIT_CONF_URI=<Full URI to Immuta configuration file> IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI=<full URI to allowedCallingClasses.json> IMMUTA_INIT_OBSCURED_COMMANDS_URI=<full URI to obscuredCommands.yaml> # (OPTIONAL) # Specify an additional configuration file to be added to the spark.sparkContext.hadoopConfiguration. # This file allows administrators to add sensitive configuration needed by the SparkSession that # should not viewable by users. # Further explanation of this variable as well as examples are provided below. IMMUTA_INIT_ADDITIONAL_CONF_URI=<full URI to additional configuration file>
-
Click the Init Scripts tab and set the following configurations:
- Destination: Specify the service you used to host the Immuta artifacts.
- File Path: Specify the full URI to the
immuta_cluster_init_script.sh
. - Add the new key/value to the configuration.
- Click the Permissions tab and configure the following setting:
- Who has access: Users or groups will need to have the permission Can Attach To to execute queries against Immuta configured data sources.
- (Re)start the cluster.
Additional Hadoop Configuration File (Optional)
As mentioned in the "Environment Variables" section of the cluster configuration, there may be
some cases where it is necessary to add sensitive configuration to SparkSession.sparkContext.hadoopConfiguration
in order to read the data composing Immuta data sources.
As an example, when accessing external tables stored in Azure Data Lake Gen 2, Spark must have credentials to access the target containers/filesystems in ADLg2, but users must not have access to those credentials. In this case, an additional configuration file may be provided with a storage account key that the cluster may use to access ADLg2.
To use an additional Hadoop configuration file, you will need to set the IMMUTA_INIT_ADDITIONAL_CONF_URI
environment
variable referenced in the Create and configure the cluster section to be the
full URI to this file.
The additional configuration file looks very similar to the Immuta Configuration file referenced above. Some example configuration files for accessing different storage layers are below.
Amazon S3
IAM Role for S3 Access
S3 can also be accessed using an IAM role attached to the cluster. See the Databricks documentation for more details.
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>[AWS access key ID]</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>[AWS secret key]</value>
</property>
</configuration>
Azure Data Lake Gen 2
<configuration>
<property>
<name>fs.azure.account.key.[storage account name].dfs.core.windows.net</name>
<value>[storage account key]</value>
</property>
</configuration>
Azure Data Lake Gen 1
ADL Prefix
Prior to Databricks Runtime version 6, the following configuration items should have a prefix of dfs.adls
rather than fs.adl
<configuration>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/[directory ID]/oauth2/token</value>
</property>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.credential</name>
<value>[client secret from Azure]</value>
</property>
<property>
<name>fs.adl.oauth2.client.id</name>
<value>[client ID from Azure]</value>
</property>
</configuration>
Azure Blob Storage
<configuration>
<property>
<name>fs.azure.account.key.[storage account name].blob.core.windows.net</name>
<value>[storage account key]</value>
</property>
</configuration>
5 - Query Immuta Data
When the Immuta enabled Databricks cluster has been successfully started, users will see a new database labeled "immuta". This database is the virtual layer provided to access data sources configured within the connected Immuta instance.
Before users can query an Immuta data source, an administrator
must give the user Can Attach To
permissions on the cluster and GRANT
the user access to the immuta
database.
The following SQL query can be run as an administrator within a journal to give the user access to "Immuta":
%sql
GRANT SELECT,READ_METADATA ON DATABASE immuta TO `user@company.com`
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta
supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your
queries like the first example. Instead, you can run queries like the second example, which does not reference the
immuta
database.
%sql
select * from immuta.my_data_source limit 5;
%sql
select * from my_data_source limit 5;
Creating a Databricks Data Source
See the Databricks Data Source Creation guide for a detailed walkthrough.
Databricks to Immuta User Mapping
By default, the IAM used to map users between Databricks and Immuta is the BIM (Immuta's internal IAM). The Immuta Spark plugin will check the Databricks username against the username within the BIM to determine access. For a basic integration, this means the users email address in Databricks and the connected Immuta instance must match.
It is possible within Immuta to have multiple users share the same username if they exist within different IAMs.
In this case, the cluster can be configured to lookup users from a specified IAM. To do this, the value of
immuta.user.mapping.iamid
created and hosted in the previous steps
must be updated to be the targeted IAM ID configured within the Immuta instance. The IAM ID can be found on the
App Settings page. Each
Databricks cluster can only
be mapped to one IAM.