DBFS Access
Audience: System Administrators
Content Summary: This page outlines how to access DBFS in Databricks for non-sensitive data. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or the
immuta_conf.xml
file (not recommended).
DBFS FUSE Mount
DBFS FUSE Mount Limitation
This feature cannot be used in environments with E2 Private Link enabled.
This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs
. Although
disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored
in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS
essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file
as though they
were local files.
For example,
%sh echo "I'm creating a new file in DBFS" > /dbfs/my/newfile.txt
In Python,
%python
with open("/dbfs/my/newfile.txt", "w") as f:
f.write("I'm creating a new file in DBFS")
Note: This solution also works in R and Scala.
Enable DBFS FUSE Mount
To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true
.
Mounting a Bucket
-
Users can mount additional buckets to DBFS that can also be accessed using the FUSE mount.
-
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
-
Mounting must be performed from a non-Immuta cluster.
Scala DBUtils (and %fs magic) with Scratch Paths
Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,
%fs put -f s3://my-bucket/my/scratch/path/mynewfile.txt "I'm creating a new file in S3"
%scala dbutils.fs.put("s3://my-bucket/my/scratch/path/mynewfile.txt", "I'm creating a new file in S3")
Configure Scala DBUtils (and %fs magic) with Scratch Paths
To support %fs magic and Scala DBUtils with scratch paths, configure
xml
<property>
<name>immuta.spark.databricks.scratch.paths</name>
<value>s3://my-bucket/my/scratch/path</value>
</property>
Configure DBUtils in Python
To use dbutils
in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false
.
Example Workflow
This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.
%python
import os
import shutil
s3ScratchFile = "s3://some-bucket/path/to/scratch/file"
localScratchDir = os.environ.get("IMMUTA_LOCAL_SCRATCH_DIR")
localScratchFile = "{}/myfile.txt".format(localScratchDir)
localScratchFileCopy = "{}/myfile_copy.txt".format(localScratchDir)
-
Get the file from remote storage:
dbutils.fs.cp(s3ScratchFile, "file://{}".format(localScratchFile))
-
Make a copy if you want to explicitly edit
localScratchFile
, as it will be read-only and owned by root:shutil.copy(localScratchFile, localScratchFileCopy) with open(localScratchFileCopy, "a") as f: f.write("Some appended file content")
-
Write the new file back to remote storage:
dbutils.fs.cp("file://{}".format(localScratchFileCopy), s3ScratchFile)