Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.
Note
In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either standard or dedicated access modes (formerly shared and single-user access modes).
Directory listing mode is supported by default. File notification mode is only supported on compute with dedicated access mode.
Specify locations for Auto Loader resources for Unity Catalog
The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.
Ingest data from cloud storage using Unity Catalog
The following examples assume the executing user has READ FILES permissions on the external location, owner privileges on the target tables, and the following configurations and grants.
Note
Azure Data Lake Storage is the only Azure storage type supported by Unity Catalog.
| Storage location | Grant |
|---|---|
abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data |
READ FILES |
abfss://dev-bucket@<storage-account>.dfs.core.windows.net |
READ FILES, WRITE FILES, CREATE TABLE |
Use Auto Loader to load to a Unity Catalog managed table
The following examples demonstrate how to use Auto Loader to ingest data to a Unity Catalog managed table.
Python
checkpoint_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/_checkpoint/dev_table"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))
SQL
CREATE OR REFRESH STREAMING TABLE dev_catalog.dev_database.dev_table
AS SELECT * FROM STREAM read_files(
'abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data',
format => 'json'
);
When you use read_files in a CREATE STREAMING TABLE statement inside a Lakeflow Spark Declarative Pipelines pipeline, checkpoint and schema locations are managed automatically.