Share via


Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

Note

In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either standard or dedicated access modes (formerly shared and single-user access modes).

Directory listing mode is supported by default. File notification mode is only supported on compute with dedicated access mode.

Specify locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Ingest data from cloud storage using Unity Catalog

The following examples assume the executing user has READ FILES permissions on the external location, owner privileges on the target tables, and the following configurations and grants.

Note

Azure Data Lake Storage is the only Azure storage type supported by Unity Catalog.

Storage location Grant
abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data READ FILES
abfss://dev-bucket@<storage-account>.dfs.core.windows.net READ FILES, WRITE FILES, CREATE TABLE

Use Auto Loader to load to a Unity Catalog managed table

The following examples demonstrate how to use Auto Loader to ingest data to a Unity Catalog managed table.

Python

checkpoint_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/_checkpoint/dev_table"

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load("abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data")
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable("dev_catalog.dev_database.dev_table"))

SQL

CREATE OR REFRESH STREAMING TABLE dev_catalog.dev_database.dev_table
AS SELECT * FROM STREAM read_files(
  'abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data',
  format => 'json'
);

When you use read_files in a CREATE STREAMING TABLE statement inside a Lakeflow Spark Declarative Pipelines pipeline, checkpoint and schema locations are managed automatically.