Simplified Databricks Configuration
Audience: System Administrators
Content Summary: This guide details the simplified installation method for enabling native access to Databricks with Immuta policies enforced.
Prerequisites: Ensure your Databricks workspace, instance, and permissions meet the guidelines outlined in the Installation Introduction.
Databricks Unity Catalog
If Unity Catalog is enabled in a Databricks workspace, you must use an Immuta cluster policy when you setup the integration to create an Immuta-enabled cluster.
1 - Add the Integration on the App Settings Page
- Log in to Immuta and click the App Settings icon in the left sidebar.
-
Scroll to the System API Key subsection under HDFS and click Generate Key.
-
Click Save and then Confirm.
- Scroll to the Native Integrations section, and click + Add a Native Integration.
- Select Databricks Integration from the dropdown menu.
-
Complete the Hostname field.
-
Enter a Unique ID for the integration. By default, your Immuta instance URL populates this field. This ID is used to tie the set of cluster policies to your instance of Immuta and allows multiple instances of Immuta to access the same Databricks workspace without cluster policy conflicts.
-
Select your configured Immuta IAM from the dropdown menu.
- Choose one of the following options for your data access model:
- Protected until made available by policy: All tables are hidden until a user is permissioned through an Immuta policy. This is how most databases work and assumes least privileged access and also means you will have to register all tables with Immuta.
- Available until protected by policy: All tables are open until explicitly registered and protected by Immuta. This makes a lot of sense if most of your tables are non-sensitive and you can pick and choose which to protect.
- Select the Storage Access Type from the dropdown menu.
- Opt to add any Additional Hadoop Configuration Files.
- Click Add Native Integration.
2 - Configure Cluster Policies
Several cluster policies are available on the App Settings page when configuring this integration:
Click a link above to read more about each of these cluster policies before continuing with the tutorial.
-
Click Configure Cluster Policies.
-
Select one or more cluster policies in the matrix by clicking the Select button(s).
-
Opt to make changes to these cluster policies by clicking Additional Policy Changes and editing the text field.
-
Use one of the two Installation Types described in the tabs below to apply the policies to your cluster:
Automatically Push Cluster Policies
This option allows you to automatically push the cluster policies to the configured Databricks workspace. This will overwrite any cluster policy templates previously applied to this workspace.
- Select the Automatically Push Cluster Policies radio button.
-
Enter your Admin Token. This token must be for a user who can create cluster policies in Databricks.
-
Click Apply Policies.
Manually Push Cluster Policies
Enabling this option will allow you to manually push the cluster policies to the configured Databricks workspace. There will be various files to download and manually push to the configured Databricks workspace.
-
Select the Manually Push Cluster Policies radio button.
-
Click Download Init Script.
-
Follow the steps in the Instructions to upload the init script to DBFS section.
-
Click Download Policies, and then manually add these Cluster Policies in Databricks.
-
Opt to click the Download the Benchmarking Suite to compare a regular Databricks cluster to one protected by Immuta. Detailed instructions are available in the first notebook, which will require an Immuta and non-Immuta cluster to generate test data and perform queries.
- Click Close, and then click Save and Confirm.
3 - Add Policies to Your Cluster
- Create a cluster in Databricks by following the Databricks documentation.
-
In the Policy dropdown, select the Cluster Policies you pushed or manually added from Immuta.
-
Select the Custom Access mode.
- Opt to adjust Autopilot Options and Worker Type settings: The default values provided here may be more than what is necessary for non-production or smaller use-cases. To reduce resource usage you can enable/disable autoscaling, limit the size and number of workers, and set the inactivity timeout to a lower value.
-
Opt to configure the Instances tab in the Advanced Options section:
- IAM Role (AWS ONLY): Select the instance role you created for this cluster. (For access key authentication, you should instead use the environment variables listed in the AWS section.)
-
Click Create Cluster.
4 - Query Immuta Data
When the Immuta-enabled Databricks cluster has been successfully started, Immuta will create an immuta
database,
which allows Immuta to track Immuta-managed data sources separately from remote Databricks tables so that policies
and other security features can be applied. However, users can query sources with their original database or table
name without referencing the immuta
database. Additionally, when configuring a Databricks cluster you can
hide immuta
from any calls to SHOW DATABASES
so that users aren't misled or confused by its presence.
For more details, see the
Hiding the immuta
Database in Databricks page.
-
Before users can query an Immuta data source, an administrator must give the user
Can Attach To
permissions on the cluster. -
See the Databricks Data Source Creation guide for a detailed walkthrough of creating Databricks data sources in Immuta.
Example Queries
Below are example queries that can be run to obtain data from an Immuta-configured data source. Because Immuta
supports raw tables in Databricks, you do not have to use Immuta-qualified table names in your
queries like the first example. Instead, you can run queries like the second example, which does not reference the
immuta
database.
%sql
select * from immuta.my_data_source limit 5;
%sql
select * from my_data_source limit 5;