Immuta Hadoop Filesystem Integration
Audience: Data Owners and Data Users
Content Summary: Immuta integrates with your Hadoop cluster to provide policy-compliant access to data sources directly through HDFS. This page instructs how to access data through the HDFS integration, which only enforces file-level controls on data. For more information on installing and configuring the Immuta Hadoop plugin, see the installation tutorial. There is also a Spark SQL integration should you need to enforce row-level and column-level controls on data.
The Immuta Hadoop plugin can also be integrated with an existing kerberos setup to allow users to access HDFS data using their existing kerberos principals, with data access and policy enforcement managed by Immuta.
Immuta HDFS Principal
When Immuta is installed on the cluster, users can only access data through HDFS using the HDFS principal that has been set for them in Immuta. This principal can only be set by an Immuta Administrator or imported from an external Identity Manager, but Immuta users can view their principal via the profile page.
Authentication
In order to access data through Immuta's HDFS Integration, you must be authenticated as the user or principal that is assigned to your Immuta HDFS principal.
- For clusters secured with kerberos, you must successfully
kinit
with your Immuta HDFS principal before attempting to access data. - For insecure clusters, you must be logged in to the cluster as the system user that is assigned to your HDFS principal.
Accessing Data
Immuta's HDFS integration allows you to access data two different ways:
-
The
immuta:///
namespace allows you to access files in relation to the Immuta data source that it is part of. For example, if you want to access a file calleddecember_report.csv
that is part of an Immuta data source calledreports
, you can access it with the following path:immuta:///immuta/reports/december_report.csv
Note that the path to the file is relative to the Immuta data source that it falls under, not the real path in HDFS. Also,
immuta:///
is restricted to only paths that a user can see - files that the user is not authorized for will not be visible. -
The HDFS integration also allows users to access data using native HDFS paths. Authorized data source subscribers can access the file
december_report.csv
through its native path in HDFS:hdfs:///actual/path/in/hdfs/december_report.csv
Note that in order for a user to access data using
hdfs:///
paths, there must be ahdfs:///user/<user>/
directory where<user>
corresponds to the user's Immuta HDFS principal. Also,hdfs:///
paths will allow users to see locations of all files, but they will only be able to read files that they have access to in Immuta.
Both methods of accessing data will be audited and compliant with data source policies. If users are not subscribed to or are policy-restricted by the data source that a file in HDFS falls under, they will not be able to access the file using either namespace.
HDFS User Impersonation
Immuta users with the IMPERSONATE_HDFS_USER
permission can create HDFS, Hive, and Impala data sources as any
HDFS user (provided that they have the proper credentials). For more information, see the tutorial for
HDFS data sources.
For Impala and Hive data sources, see the
Query-backed Data Source tutorial.