Spark Integration

Audience: Data Owners and Data Users

Content Summary: Users can access subscribed data sources within their Spark jobs by using SparkSQL with the ImmutaSession class (Spark 2.4). Immuta enforces SparkSQL controls on data storage technologies that support batch processing workloads. Through this process, all tables are virtual and empty until a query is materialized.

When a query is materialized, standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS. Other data source types access data using the Query Engine, which proxies the query to the native database technology and automatically enforces policies for each data source.

Security of data sources is enforced both server-side and client-side. Server-side security is provided by an external partitioning service and client-side security is provided by a Java SecurityManager to moderate access to sensitive information.

Spark Integration Specific to CDH and EMR

The Spark integration is only supported by CDH and EMR integrations.

Spark Integration

Section Contents