Ingest files from SharePoint

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

:::note Compliance

The SharePoint connector supports use in workspaces with the Configure enhanced security and compliance settings enabled.

:::

You can ingest structured, semi-structured, and unstructured files from Microsoft SharePoint into Delta tables. The SharePoint connector supports incremental ingestion of SharePoint files using batch and streaming APIs, including Auto Loader, spark.read and COPY INTO, all with Unity Catalog governance.

Choose your SharePoint connector

Lakeflow Connect offers two complementary SharePoint connectors. They both access data in SharePoint, but they support distinct goals.

Consideration	Managed SharePoint connector	Standard SharePoint connector
Management and customization	A fully-managed connector. Simple, low-maintenance connectors for enterprise applications that ingest data in to Delta tables and keep them in sync with the source. See Managed connectors in Lakeflow Connect.	Build custom ingestion pipelines with SQL, PySpark, or Lakeflow Spark Declarative Pipelines using batch and streaming APIs such as `read_files`, `spark.read`, `COPY INTO`, and Auto Loader. Offers the flexibility to perform complex transformations during ingestion, while giving you greater responsibility for managing and maintaining your pipelines.
Output format	Uniform binary content table. Ingests each file in binary format (one file per row), along with file metadata in additional columns.	Structured Delta tables. Ingests structured files (like CSV and Excel) as Delta tables. Can also be used to ingest unstructured files in binary format.
Granularity, filtering, and selection	No subfolder or file level selection today. No pattern-based filtering. Ingests all files in the specified SharePoint document library.	Granular and custom. URL-based selection to ingest from document libraries, subfolders, or individual files. Also supports pattern-based filtering using the `pathGlobFilter` option.

Key features

The standard SharePoint connector offers:

Ingestion of structured, semi-structured, and unstructured files
Granular ingestion: Ingest a specific site, a sub-site, a document library, a folder, or a single file
Batch and streaming ingestion using spark.read, Auto Loader, and COPY INTO
Automatic schema inference and evolution for structured and semi-structured formats such as CSV and Excel
Secure credentials storage with a Unity Catalog connection
File selection with pattern matching using pathGlobFilter

Requirements

To ingest files from SharePoint, you must have the following:

A workspace with Unity Catalog enabled.
CREATE CONNECTION privileges to create a SharePoint connection, or the appropriate privilege to use an existing one based on your cluster access mode:
- Dedicated Access Mode: MANAGE CONNECTION.
- Standard Access Mode: USE CONNECTION.
Compute that uses Databricks Runtime version 17.3 LTS or later.
OAuth authentication set up with the Sites.Read.All or Sites.Selected permission scope.
The SharePoint Beta feature is enabled from the Previews page. See Manage Azure Databricks previews.
Optional: Enable the Excel Beta feature for parsing Excel files. See Read Excel files.

Create the connection

Create a Unity Catalog connection to store your SharePoint credentials. The connection setup process is shared between both the standard and managed SharePoint connectors.

For complete connection setup instructions including OAuth authentication options, see Overview of SharePoint ingestion setup.

Read files from SharePoint

To read files, pass the connection you created using the databricks.connection option and a URL that points to the SharePoint resource you want to access. The URL you provide determines the scope of the ingestion.

The following path types are supported on Databricks Runtime 17.3 LTS and later:

Path type	Description
Site	Copy the site URL from the address bar. `https://mytenant.sharepoint.com/sites/test-site`
Subsite	Copy the subsite URL from the address bar. `https://mytenant.sharepoint.com/sites/test-site/test-subsite`
Document library	Open the library from Site contents and copy the URL from the address bar. `https://mytenant.sharepoint.com/sites/test-site/Shared%20Documents` `https://mytenant.sharepoint.com/sites/test-site/custom-drive`
Folder	Open the folder from Site contents and copy the URL from the address bar. Alternatively, open the folder's Details pane in SharePoint and click the copy icon next to Path. `https://mytenant.sharepoint.com/sites/test-site/Shared%20Documents/Forms/AllItems.aspx?id=%2Fsites...` `https://mytenant.sharepoint.com/sites/test-site/custom-drive/test-folder`
File	Select the file, click the overflow menu (...), and select Preview. Copy the URL from the address bar. Alternatively, open the file's Details pane in SharePoint and click the copy icon next to Path. `https://mytenant.sharepoint.com/sites/test-site/Shared%20Documents/Forms/AllItems.aspx?viewid=1a2b3c...` `https://mytenant.sharepoint.com/sites/test-site/custom-drive/test-folder/test.csv`

Databricks Runtime 18.3 and later adds support for the following path types:

Path type	Description
Tenant	Copy the tenant root URL from the address bar. `https://mytenant.sharepoint.com`
Nested subsite	Copy the subsite URL from the address bar. `https://mytenant.sharepoint.com/sites/test-site/subsite/nested-subsite/nested-nested-subsite`
Share link	Select the file or folder, click the overflow menu (...), and select Copy link. Databricks recommends setting the share link to never expire. `https://mytenant.sharepoint.com/:i:/s/test-site/1A2B3C4D5E6F7G8H9I`
Microsoft 365 for the web (formerly Office)	Open the file in Microsoft 365 for the web and copy the URL from the address bar. `https://mytenant.sharepoint.com/:x:/r/sites/test-site/_layouts/15/Doc.aspx?sourcedoc=%1A2B...`

Examples

There are a few ways to read files using the standard SharePoint connector.

Stream SharePoint files using Auto Loader

Auto Loader provides the most efficient way to incrementally ingest structured files from SharePoint. It automatically detects new files and processes them as they arrive. It can also ingest structured and semi-structured files such as CSV and JSON with automatic schema inference and evolution. For details about Auto Loader usage, see Common data loading patterns.

# Incrementally ingest new PDF files
df = (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "binaryFile")
    .option("databricks.connection", "my_sharepoint_conn")
    .option("cloudFiles.schemaLocation", <path to a schema location>)
    .option("pathGlobFilter", "*.pdf")
    .load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
)

# Incrementally ingest CSV files with automatic schema inference and evolution
df = (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("databricks.connection", "my_sharepoint_conn")
    .option("pathGlobFilter", "*.csv")
    .option("inferColumnTypes", True)
    .option("header", True)
    .load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs")
)

Read SharePoint files using Spark batch read

The following example shows how to ingest SharePoint files in Python using the spark.read function.

# Read unstructured data as binary files
df = (spark.read
        .format("binaryFile")
        .option("databricks.connection", "my_sharepoint_conn")
        .option("recursiveFileLookup", True)
        .option("pathGlobFilter", "*.pdf") # optional. Example: only ingest PDFs
        .load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents"))

# Read a batch of CSV files, infer the schema, and load the data into a DataFrame
df = (spark.read
        .format("csv")
        .option("databricks.connection", "my_sharepoint_conn")
        .option("pathGlobFilter", "*.csv")
        .option("recursiveFileLookup", True)
        .option("inferSchema", True)
        .option("header", True)
        .load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs"))

# Read a specific Excel file from SharePoint, infer the schema, and load the data into a DataFrame
df = (spark.read
        .format("excel")
        .option("databricks.connection", "my_sharepoint_conn")
        .option("headerRows", 1)                   # optional
        .option("dataAddress", "Sheet1!A1:M20")  # optional
        .load("https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx"))

Read SharePoint files using Spark SQL

The following example shows how to ingest SharePoint files in SQL using the read_files table-valued function. For details about read_files usage, see read_files table-valued function.

-- Read pdf files
CREATE TABLE my_table AS
SELECT * FROM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  `databricks.connection` => "my_sharepoint_conn",
  format => "binaryFile",
  pathGlobFilter => "*.pdf", -- optional. Example: only ingest PDFs
  schemaEvolutionMode => "none"
);

-- Read a specific Excel sheet and range
CREATE TABLE my_sheet_table AS
SELECT * FROM read_files(
  "https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx",
  `databricks.connection` => "my_sharepoint_conn",
  format => "excel",
  headerRows => 1,  -- optional
  dataAddress => "Sheet1!A2:D10", -- optional
  schemaEvolutionMode => "none"
);

Incremental ingestion with `COPY INTO`

COPY INTO provides idempotent incremental loading of files into a Delta table. For details about COPY INTO usage, see Common data loading patterns using COPY INTO.

CREATE TABLE IF NOT EXISTS sharepoint_pdf_table;
CREATE TABLE IF NOT EXISTS sharepoint_csv_table;
CREATE TABLE IF NOT EXISTS sharepoint_excel_table;

# Incrementally ingest new PDF files
COPY INTO sharepoint_pdf_table
  FROM "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents"
  FILEFORMAT = BINARYFILE
  PATTERN = '*.pdf'
  FORMAT_OPTIONS ('databricks.connection' = 'my_sharepoint_conn')
  COPY_OPTIONS ('mergeSchema' = 'true');

# Incrementally ingest CSV files with automatic schema inference and evolution
COPY INTO sharepoint_csv_table
  FROM "https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs"
  FILEFORMAT = CSV
  PATTERN = '*.csv'
  FORMAT_OPTIONS ('databricks.connection' = 'my_sharepoint_conn', 'header' = 'true', 'inferSchema' = 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');

# Ingest a single Excel file
COPY INTO sharepoint_excel_table
  FROM "https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx"
  FILEFORMAT = EXCEL
  FORMAT_OPTIONS ('databricks.connection' = 'my_sharepoint_conn', 'headerRows' = '1')
  COPY_OPTIONS ('mergeSchema' = 'true');

Ingest SharePoint files in Lakeflow Spark Declarative Pipelines

Note

The SharePoint connector requires Databricks Runtime 17.3 or later. To use the connector, set "CHANNEL" = "PREVIEW" in your pipeline settings. For more information on previews, see Pipeline properties reference.

The following examples show how to read SharePoint files using Auto Loader in Lakeflow Spark Declarative Pipelines.

Python

from pyspark import pipelines as dp

# Incrementally ingest new PDF files
@dp.table
def sharepoint_pdf_table():
  return (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "binaryFile")
    .option("databricks.connection", "my_sharepoint_conn")
    .option("pathGlobFilter", "*.pdf")
    .load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
  )

# Incrementally ingest CSV files with automatic schema inference and evolution
@dp.table
def sharepoint_csv_table():
  return (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("databricks.connection", "my_sharepoint_conn")
      .option("pathGlobFilter", "*.csv")
      .option("inferColumnTypes", True)
      .option("header", True)
      .load("https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs")
  )

# Read a specific Excel file from SharePoint in a materialized view
@dp.table
def sharepoint_excel_table():
  return (spark.read.format("excel")
    .option("databricks.connection", "my_sharepoint_conn")
    .option("headerRows", 1)                   # optional
    .option("inferColumnTypes", True)            # optional
    .option("dataAddress", "Sheet1!A1:M20")  # optional
    .load("https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx")

SQL

-- Incrementally ingest new PDF files
CREATE OR REFRESH STREAMING TABLE sharepoint_pdf_table
AS SELECT * FROM STREAM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  format => "binaryFile",
  `databricks.connection` => "my_sharepoint_conn",
  pathGlobFilter => "*.pdf");

-- Incrementally ingest CSV files with automatic schema inference and evolution
CREATE OR REFRESH STREAMING TABLE sharepoint_csv_table
AS SELECT * FROM STREAM read_files(
  "https://mytenant.sharepoint.com/sites/Engineering/Data/IoT_Logs",
  format => "csv",
  `databricks.connection` => "my_sharepoint_conn",
  pathGlobFilter => "*.csv",
  "header", "true");

-- Read a specific Excel file from SharePoint in a materialized view
CREATE OR REFRESH MATERIALIZED VIEW sharepoint_excel_table
AS SELECT * FROM read_files(
  "https://mytenant.sharepoint.com/sites/Finance/Shared%20Documents/Monthly/Report-Oct.xlsx",
  `databricks.connection` => "my_sharepoint_conn",
  format => "excel",
  headerRows => 1,  -- optional
  dataAddress => "Sheet1!A2:D10", -- optional
  `cloudFiles.schemaEvolutionMode` => "none"
);

Parse unstructured files

When ingesting unstructured files from SharePoint (such as PDFs, Word documents, or PowerPoint files) using the standard SharePoint connector with binaryFile format, the file contents are stored as raw binary data. To prepare these files for AI workloads—such as RAG, search, classification, or document understanding—you can parse the binary content into structured, queryable output using ai_parse_document.

The following example shows how to parse unstructured documents stored in a bronze Delta table named documents, adding a new column with parsed content:

CREATE TABLE documents AS
SELECT * FROM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  `databricks.connection` => "my_sharepoint_conn",
  format => "binaryFile",
  pathGlobFilter => "*.{pdf,docx}",
  schemaEvolutionMode => "none"
);
SELECT *, ai_parse_document(content) AS parsed_content
FROM documents;

The parsed_content column contains extracted text, tables, layout information, and metadata that can be directly used for downstream AI pipelines.

Incremental parsing with Lakeflow Spark Declarative Pipelines

You can also use ai_parse_document within Lakeflow Spark Declarative Pipelines to enable incremental parsing. As new files stream in from SharePoint, they are automatically parsed as your pipeline updates.

For example, you can define a materialized view that continuously parses newly ingested documents:

CREATE OR REFRESH STREAMING TABLE sharepoint_documents_table
AS SELECT * FROM STREAM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  format => "binaryFile",
  `databricks.connection` => "my_sharepoint_conn",
  pathGlobFilter => "*.{pdf,docx}");

CREATE OR REFRESH MATERIALIZED VIEW documents_parsed
AS
SELECT *, ai_parse_document(content) AS parsed_content
FROM sharepoint_documents_table;

This approach ensures that:

Newly ingested SharePoint files are parsed automatically whenever the materialized view refreshes
Parsed outputs stay in sync with incoming data
Downstream AI pipelines always operate on up-to-date document representations

Learn more: See ai_parse_document for supported formats and advanced options.

SharePoint metadata column

Important

This feature is in Private Preview. To try it, reach out to your Azure Databricks contact.

The _sharepoint_metadata column is a hidden metadata column that provides access to SharePoint-specific properties of ingested files, sourced from the Microsoft Graph driveItem resource. It requires Databricks Runtime 18.1 or above and is available for all file formats when reading from SharePoint. To include the _sharepoint_metadata column in the returned DataFrame, you must explicitly select it in the read query.

If the data source contains a column named _sharepoint_metadata, the SharePoint metadata column is renamed to __sharepoint_metadata (with an extra leading underscore) to deduplicate. Additional underscores are added until the name is unique.

Common file metadata like the file path or size can be queried using the _metadata column. For more information, see File metadata column.

Schema

The _sharepoint_metadata column is a STRUCT containing the following fields. All fields are nullable.

Name	Type	Description	Example	Minimum Databricks Runtime release
item_id	`STRING`	The item's driveItem ID.	`01OMQ3MNLH42C5J675CBEI5CRK7SPKQUTZ`	18.1
site_id	`STRING`	The ID of the SharePoint site that contains the item.	`mytenant.sharepoint.com,69dc7b12-f92c-498d-9514-596b793a1f77,c6c1db8d-2b8d-48a1-a549-394b63d74725`	18.1
drive_id	`STRING`	The ID of the drive that contains the item.	`b!EnvcaSz5jUmVFFlreTofd43bwcaNK6FIpUk5S2PXRyWTvQraaWQkSpwQEgThHDS-`	18.1
drive_type	`STRING`	The type of drive, for example `documentLibrary` for SharePoint libraries or `business` for OneDrive for Business.	`documentLibrary`	18.1
parent_id	`STRING`	The driveItem ID of the parent folder.	`01OMQ3MNN6Y2GOVW7725BZO354PWSELRRZ`	18.1
parent_name	`STRING`	The name of the parent folder.	`Shared Documents`	18.1
parent_path	`STRING`	The drive-relative path of the parent folder.	`/drives/b!EnvcaSz5.../root:`	18.1
web_url	`STRING`	The browser URL of the item on SharePoint.	`https://mytenant.sharepoint.com/sites/TestSite/_layouts/15/Doc.aspx?sourcedoc=...`	18.1
mime_type	`STRING`	The MIME type of the item.	`application/vnd.ms-excel`	18.1
created_by_email	`STRING`	The email of the user who created the item.	`alice@example.onmicrosoft.com`	18.1
created_by_name	`STRING`	The display name of the user who created the item.	`Alice Example`	18.1
created_timestamp	`TIMESTAMP`	The time the item was created.	`2025-12-03 13:33:12`	18.1
last_modified_by_email	`STRING`	The email of the user who last modified the item.	`alice@example.onmicrosoft.com`	18.1
last_modified_by_name	`STRING`	The display name of the user who last modified the item.	`Alice Example`	18.1
etag	`STRING`	The item's ETag. Changes when the item or any of its metadata changes.	`"{D485E667-FDFB-4810-8E8A-2AFC9EA85279},1"`	18.1
ctag	`STRING`	The item's change tag. Changes only when the item's content changes.	`"c:{D485E667-FDFB-4810-8E8A-2AFC9EA85279},1"`	18.1
description	`STRING`	The item description, if set.	`Q4 financial report`	18.1
additional_metadata	`VARIANT`	Any other driveItem fields returned by Microsoft Graph but not extracted above.	`{"shared":{"scope":"users"},...}`	18.1

Note

The additional_metadata field is returned as VARIANT. See VARIANT type.

Examples

The following examples show how to include the _sharepoint_metadata column in a read query, select specific fields from the column, and extract values from the additional_metadata VARIANT field.

Python

df = (spark.read
        .format("binaryFile")
        .option("databricks.connection", "my_sharepoint_conn")
        .load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
        .select("*", "_metadata", "_sharepoint_metadata"))

SQL

SELECT *, _sharepoint_metadata
FROM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  `databricks.connection` => "my_sharepoint_conn",
  format => "binaryFile"
);

Select specific fields from the _sharepoint_metadata struct:

df = (spark.read
        .format("binaryFile")
        .option("databricks.connection", "my_sharepoint_conn")
        .load("https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents")
        .select("_sharepoint_metadata.item_id", "_sharepoint_metadata.etag"))

Extract values from the additional_metadata VARIANT field using the :: cast operator:

SELECT
  *,
  _sharepoint_metadata.additional_metadata:shared:scope::STRING AS shared_scope
FROM read_files(
  "https://mytenant.sharepoint.com/sites/Marketing/Shared%20Documents",
  `databricks.connection` => "my_sharepoint_conn",
  format => "binaryFile"
);

Limitations

The standard SharePoint connector has the following limitations.

No multi-site ingestion: You cannot ingest multiple sites with the same query. To ingest from two sites, you must write two separate queries.
Filtering: You can use the pathGlobFilter option to filter files by name. Folder path-based filtering is not supported.
Unsupported formats: SharePoint Lists and .aspx site pages are not supported. Only files in Document Libraries are supported.
Writing back to a SharePoint server is not supported.
Auto Loader cleanSource (deleting or archiving files at the source after ingestion) is not supported.

Next steps

Learn about Auto Loader for advanced streaming ingestion patterns
Explore COPY INTO for idempotent incremental loads
Compare with cloud object storage ingestion patterns
Set up job scheduling to automate your ingestion workflows
Use Lakeflow Spark Declarative Pipelines to build end-to-end data pipelines with transformations

Feedback

Was this page helpful?

Last updated on 2026-04-24

Ingest files from SharePoint

Choose your SharePoint connector

Key features

Requirements

Create the connection

Read files from SharePoint

Examples

Stream SharePoint files using Auto Loader

Read SharePoint files using Spark batch read

Read SharePoint files using Spark SQL

Incremental ingestion with COPY INTO

Ingest SharePoint files in Lakeflow Spark Declarative Pipelines

Python

SQL

Parse unstructured files

Incremental parsing with Lakeflow Spark Declarative Pipelines

SharePoint metadata column

Schema

Examples

Python

SQL

Limitations

Next steps

Feedback

Additional resources

Incremental ingestion with `COPY INTO`