Spark Integration
Audience: Data Owners and Data Users
Content Summary: Users can access subscribed data sources within their Spark jobs by using SparkSQL with the
ImmutaSession
class (Spark 2.4). Immuta enforces SparkSQL controls on data storage technologies that support batch processing workloads. Through this process, all tables are virtual and empty until a query is materialized.When a query is materialized, standard Spark libraries access data from metastore-backed data sources (like Hive and Impala) to retrieve the data from the underlying files stored in HDFS. Other data source types access data using the Query Engine, which proxies the query to the native database technology and automatically enforces policies for each data source.
Security of data sources is enforced both server-side and client-side. Server-side security is provided by an external partitioning service and client-side security is provided by a Java SecurityManager to moderate access to sensitive information.
Spark Integration Specific to CDH and EMR
The Spark integration is only supported by CDH and EMR integrations.