Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell

Databricks recently announced general availability of their “New Files Experience for Databricks Workspace”. It allows you to store and interact with non-notebook type files alongside your notebooks. This a very powerful feature when it comes to code modularization/reuse as well as environment configuration. The thing is, is that getting these to a non-git-enabled environment can be tricky.

: Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell

When looking into the Databricks Workspace API 2.0, you see a section for importing files into a workspace. There are three different options for doing so: 1) Import a base64-encoded string, Import a local notebook, and Import a local file. From first glance it seems as if using the third option would be the choice here, but upon reading through the only example provided, it seems less straight forward…

curl --netrc --request POST \
  https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/import \
  --header 'Content-Type: multipart/form-data' \
  --form path=/Repos/me@example.com/MyRepo/my_file.py \
  --form format=AUTO \
  --form content=@non-notebook.py

https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/workspace#example-2

I’ve used various different APIs from Databricks to do such things as run notebooks, deploy jobs, and upload things to dbfs. There’s even a DevOps utility to deploying notebooks to workspaces, but it does not support non-notebook type files. So how do you accomplish this in a more straight forward manner?

First, you need to have files to import. If you work in an environment where only the development workspace is linked to git and you need to deploy source controlled files to other environment workspaces, you will need a utility for this. Where you stored and source your files is up to you. In this example, we include them in a DevOps build artifact and programmatically search and deploy them, but that doesn’t mean you can’t just pull them straight from git and deploy them like that.

Once you have a set of files to import, you will see the exported files from a Databricks workspace, that are non-notebook files, are automatically suffixed with a “_”. So a local file that is a “.py” file becomes a “.py_” or a configuration file that is a “.json” file becomes a “.json_” when exporting. This comes in handy when keeping these files alongside your notebook files when using the DevOps deploy notebook utility, since these extensions are not picked up. We export our files from our development environment and save them as .py files in DevOps repos, so I am not sure how these file types are stored if you have a linked repo. It’ll be up to you to check on that.

Once you have your files somewhere PowerShell can see them, they are retrieved using the cmdlet Get-ChildItem. This list of items is then looped and deployed one at a time using the 2.0/workspace/import API. Seems straight forward? Sure, but Databricks wants a base64-encoded string with this API, and this becomes problematic when dealing with whitespace, because it is ignored when encoding. Once the source text is imported into a workspace file, all line breaks are missing. This is fine for a .json file, but for other file types like .py, that depend on specific indentation and line breaks, it does not work.

In order to fix this, you can replace the newline and carriage return and line feed symbols with just a carriage return to fix this issue. Then, once encoded, it can be sent over this API to be imported into your workspace. Again… this isn’t documented anywhere.

You will also see that that if you import a “.py_” file to a workspace it remains a “.py_” file and does not change back to a “.py” file you can use alongside a notebook. To fix this, the scrip simply takes off the last character of the file name before using it in the API body.

The final json body for the API call looks like this:

$workspaceImportJson = '{ "path": "/Notebooks/Example.py", "content": "'+$encodedContent+'", "language": "AUTO", "overwrite": true, "format": "AUTO" }'

You can see that the path does not include the “/Workspace/” folder, because that is where the API is made to import to and that the language and format are both set to “AUTO”. This allows us to import any type of file besides a notebook file.

And a call from DevOps would look something like:

-url '$(databricks-workspaceURL)' -token '$(databricks-APIKey)' -folderPath '$(System.DefaultWorkingDirectory)/_Databricks-CI/$(Release.Artifacts._Databricks-CI.BuildNumber)/Databricks/Notebooks/Demo/' -deployFolder '/Demo/'

The file script looks like this:

param(
    [Parameter(Mandatory=$True, Position=0, ValueFromPipeline=$false)]
    [System.String]
    $url,

    [Parameter(Mandatory=$False, Position=1, ValueFromPipeline=$false)]
    [System.String]
    $token,

    [Parameter(Mandatory=$False, Position=2, ValueFromPipeline=$false)]
    [System.String]
    $folderPath,

    [Parameter(Mandatory=$False, Position=3, ValueFromPipeline=$false)]
    [System.String]
    $deployFolder
)
################# Test Parameters ######################
# $url = "https://adb-################.##.azuredatabricks.net"
# $token = "dapiabc123def456ghi789jkl012lmn345-1"
# $folderPath = "C:\path\to\your\databricks\notebooks\files\"
# $deployPath = "/workspace/path/"
########################################################
$dbrAPI = $url + "/api/"
$apiHeaders = @{
    "Authorization" = "Bearer $token"
    "Content-Type" = "Content-Type: multipart/form-data"
}

if($token -eq "" -or $token -eq $null) {
    Write-Host "Databricks workspace not deployed yet or token not set in variables."
    Write-Host "Please run a deploy, create an Access Token, update the DevOps variable databricks-token with the access token, and rerun the deployment to schedule a job."
}
else {
    [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12

    Write-Host "Getting Files to be Deployed..."


    $files = Get-ChildItem -Recurse -Include *.json_, *.py_, *.config_ -Path $folderPath
    
    $workspaceImportUri = $dbrAPI + "2.0/workspace/import"

    foreach($file in $files) {

        $newPath = $($file.Directory).ToString().Replace("\", "/")
        $newFileName = $($file.Name).Substring(0, ($($file.Name).Length - 1))
        $deployPath = $newPath.Substring($newPath.IndexOf($deployFolder), $newPath.Length - $newPath.IndexOf($deployFolder)) + "/" + $newFileName


        Write-Host "Deploying:$($file.Directory)\$($file.Name) to $deployPath"

        $fileContent = Get-Content "$($file.Directory)\$($file.Name)" -Raw

        $fileContent = $fileContent.Replace("`r`n","`n")
        
        $Bytes = [System.Text.Encoding]::GetEncoding('UTF-8').GetBytes($fileContent)
        $encodedContent = [System.Convert]::ToBase64String($Bytes)        

        $workspaceImportJson = '{ "path": "'+$deployPath+'", "content": "'+$encodedContent+'", "language": "AUTO", "overwrite": true, "format": "AUTO" }'

        Invoke-RestMethod -Uri $workspaceImportUri -Method POST -Headers $apiHeaders -Body $workspaceImportJson -UseBasicParsing
    }
}

https://github.com/CharlesRinaldini/Databricks/blob/main/WorkspaceFileCreate.ps1

Leave a Reply

Your email address will not be published. Required fields are marked *