Skip to content

You are viewing documentation for Immuta version 2022.5.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Immuta Hadoop Filesystem Integration

Audience: Data Owners and Data Users

Content Summary: Immuta integrates with your Hadoop cluster to provide policy-compliant access to data sources directly through HDFS. This page instructs how to access data through the HDFS integration, which only enforces file-level controls on data. For more information on installing and configuring the Immuta Hadoop plugin, see the installation tutorial. There is also a Spark SQL integration should you need to enforce row-level and column-level controls on data.

The Immuta Hadoop plugin can also be integrated with an existing kerberos setup to allow users to access HDFS data using their existing kerberos principals, with data access and policy enforcement managed by Immuta.

Immuta HDFS Principal

When Immuta is installed on the cluster, users can only access data through HDFS using the HDFS principal that has been set for them in Immuta. This principal can only be set by an Immuta Administrator or imported from an external Identity Manager, but Immuta users can view their principal via the profile page.

Authentication

In order to access data through Immuta's HDFS Integration, you must be authenticated as the user or principal that is assigned to your Immuta HDFS principal.

  • For clusters secured with kerberos, you must successfully kinit with your Immuta HDFS principal before attempting to access data.
  • For insecure clusters, you must be logged in to the cluster as the system user that is assigned to your HDFS principal.

Accessing Data

Immuta's HDFS integration allows you to access data two different ways:

  • The immuta:/// namespace allows you to access files in relation to the Immuta data source that it is part of. For example, if you want to access a file called december_report.csv that is part of an Immuta data source called reports, you can access it with the following path:

    immuta:///immuta/reports/december_report.csv

    Note that the path to the file is relative to the Immuta data source that it falls under, not the real path in HDFS. Also, immuta:/// is restricted to only paths that a user can see - files that the user is not authorized for will not be visible.

  • The HDFS integration also allows users to access data using native HDFS paths. Authorized data source subscribers can access the file december_report.csv through its native path in HDFS:

    hdfs:///actual/path/in/hdfs/december_report.csv

    Note that in order for a user to access data using hdfs:/// paths, there must be a hdfs:///user/<user>/ directory where <user> corresponds to the user's Immuta HDFS principal. Also, hdfs:/// paths will allow users to see locations of all files, but they will only be able to read files that they have access to in Immuta.

Both methods of accessing data will be audited and compliant with data source policies. If users are not subscribed to or are policy-restricted by the data source that a file in HDFS falls under, they will not be able to access the file using either namespace.

HDFS User Impersonation

Immuta users with the IMPERSONATE_HDFS_USER permission can create HDFS, Hive, and Impala data sources as any HDFS user (provided that they have the proper credentials). For more information, see the tutorial for HDFS data sources. For Impala and Hive data sources, see the Query-backed Data Source tutorial.