Run spark-submit
Jobs on Databricks
Audience: System Administrators
Content Summary: This guide illustrates how to run R and Scala
spark-submit
jobs on Databricks, including prerequisites and caveats.
Language Support
R and Scala are supported, but require advanced configuration; work with your Immuta support professional to use these languages. Python spark submit jobs are not supported by the Databricks Spark integration.
Using R in a Notebook
Because of how some user properties are populated in Databricks, users should load the SparkR library in a separate cell before attempting to use any SparkR functions:
R spark-submit
Prerequisites
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
-
Initialize the Spark session by entering these settings into the R submit script
immuta.spark.acl.assume.not.privileged="true"
andspark.hadoop.immuta.databricks.config.update.service.enabled="false"
.This will enable the R script to access Immuta data sources, scratch paths, and workspace tables.
-
Once the script is written, upload the script to a location in
dbfs/S3/ABFS
to give the Databricks cluster access to it.
Create the R spark submit
Job
To create the R spark-submit
job,
-
Go to the Databricks jobs page.
-
Create a new job, and select Configure spark-submit.
-
Set up the parameters:
[ "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service", "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service", "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r", "dbfs:/path/to/script.R", "arg1", "arg2", "..." ]
Note: The path
dbfs:/path/to/script.R
can be in S3 or ABFS (on Azure Databricks), assuming the cluster is configured with access to that path. -
Edit the cluster configuration, and change the Databricks Runtime to be a supported version (5.5, 6.4, 7.3, or 7.4).
-
Configure the Environment Variables section as you normally would for an Immuta cluster.
Scala spark-submit
Prerequisites
Before you can run spark-submit
jobs on Databricks you must initialize the Spark session with the settings outlined below.
-
Configure the Spark session with
immuta.spark.acl.assume.not.privileged="true"
andspark.hadoop.immuta.databricks.config.update.service.enabled="false"
.Note: Stop your Spark session (
spark.stop()
) at the end of your job or the cluster will not terminate. -
The spark submit job needs to be launched using a different classloader which will point at the designated user JARs directory. The following Scala template can be used to handle launching your submit code using a separate classloader:
package com.example.job import java.net.URLClassLoader import java.io.File import org.apache.spark.sql.SparkSession object ImmutaSparkSubmitExample { def main(args: Array[String]): Unit = { val jarDir = new File("/databricks/immuta/jars/") val urls = jarDir.listFiles.map(_.toURI.toURL) // Configure a new ClassLoader which will load jars from the additional jars directory val cl = new URLClassLoader(urls) val jobClass = cl.loadClass(classOf[ImmutaSparkSubmitExample].getName) val job = jobClass.newInstance jobClass.getMethod("runJob").invoke(job) } } class ImmutaSparkSubmitExample { def getSparkSession(): SparkSession = { SparkSession.builder() .appName("Example Spark Submit") .enableHiveSupport() .config("immuta.spark.acl.assume.not.privileged", "true") .config("spark.hadoop.immuta.databricks.config.update.service.enabled", "false") .getOrCreate() } def runJob(): Unit = { val spark = getSparkSession try { val df = spark.table("immuta.<YOUR DATASOURCE>") // Run Immuta Spark queries... } finally { spark.stop() } } }
Create the Scala spark-submit
Job
To create the Scala spark-submit
job,
-
Build and upload your JAR to
dbfs/S3/ABFS
where the cluster has access to it. -
Select Configure spark-submit, and configure the parameters:
[ "--conf","spark.driver.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service", "--conf","spark.executor.extraJavaOptions=-Djava.security.manager=com.immuta.security.ImmutaSecurityManager -Dimmuta.security.manager.classes.config=file:///databricks/immuta/allowedCallingClasses.json -Dimmuta.spark.encryption.fpe.class=com.immuta.spark.encryption.ff1.ImmutaFF1Service", "--conf","spark.databricks.repl.allowedLanguages=python,sql,scala,r", "--class","org.youorg.package.MainClass", "dbfs:/path/to/code.jar", "arg1", "arg2", "..." ]
Note: The fully-qualified class name of the class whose
main
function will be used as the entry point for your code in the--class
parameter.Note: The path
dbfs:/path/to/code.jar
can be in S3 or ABFS (on Azure Databricks) assuming the cluster is configured with access to that path. -
Edit the cluster configuration, and change the Databricks Runtime to a supported version (5.5, 6.4, 7.3, or 7.4).
-
Include
IMMUTA_INIT_ADDITIONAL_JARS_URI=dbfs:/path/to/code.jar
in the "Environment Variables" (wheredbfs:/path/to/code.jar
is the path to your jar) so that the jar is uploaded to all the cluster nodes.
Caveats
-
The user mapping works differently from notebooks because
spark-submit
clusters are not configured with access to the Databricks SCIM API. The cluster tags are read to get the cluster creator and match that user to an Immuta user. -
Privileged users (Databricks Admins and Whitelisted Users) must be tied to an Immuta user and given access through Immuta to access data through
spark-submit
jobs because the settingimmuta.spark.acl.assume.not.privileged="true"
is used. -
There is an option of using the
immuta.api.key
setting with an Immuta API key generated on the Immuta Profile Page. -
Currently when an API key is generated it invalidates the previous key. This can cause issues if a user is using multiple clusters in parallel, since each cluster will generate a new API key for that Immuta user. To avoid these issues, manually generate the API key in Immuta and set the
immuta.api.key
on all the clusters or use a specified job user for the submit job.