As a Senior Software Engineer who has led the deployment of Databricks across multiple workspaces at Nike, I’ve gained deep insights into how Databricks operates on AWS. In this post, I’ll break down the architecture, share key learnings, and provide best practices for enterprise deployments.

Note: This post is focused on “Classic” Databricks that deploys into your AWS account. Databricks has launched Serverless offerings, which we will look into in a follow-up post.

Table of Contents

  1. Core Components
  2. Network Architecture
  3. Security and Access Control
  4. Storage Integration
  5. Best Practices
  6. Common Challenges and Solutions

Core Components

Control Plane vs Data Plane

Databricks on AWS follows a distinctive architecture that separates the control plane from the data plane:

Control Plane:

  • Managed by Databricks in their AWS account
  • Handles workspace management, cluster provisioning, and job scheduling
  • Operates in specific AWS regions (e.g., us-west-2, us-east-1)
  • Communicates with workspace VPC through AWS PrivateLink (if configured) otherwise uses AWS inter-account networking backbone

Data Plane:

  • Runs in your AWS account
  • Contains actual compute resources (EC2 instances)
  • Processes data within your VPC
  • Accesses data from your S3 buckets

Workspace Components

  1. Web Application

    • Provides notebook interface
    • Manages users and permissions
    • Handles job scheduling and monitoring
  2. Cluster Manager

    • Provisions and manages EC2 instances
    • Handles auto-scaling
    • Manages cluster lifecycle
  3. Metastore Service

    • Manages table metadata
    • Integrates with Unity Catalog
    • Handles schema evolution

Network Architecture

VPC Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Example Terraform configuration for Databricks VPC

provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      Environment = "Test"
      workspace   = "demo-workspace" # Adding workspace name as tag helps with identifying resources.
    }
  }
}

resource "aws_vpc" "databricks_vpc" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "databricks-vpc"
  }
}

# Private subnets for worker nodes
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.databricks_vpc.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "databricks-private-${count.index + 1}"
  }
}

Networking Best Practices

  1. Use at least two private subnets across different AZs, 3 for enterprise deployments
  2. Implement NAT Gateway for outbound traffic
  3. Configure AWS PrivateLink for secure control plane communication and also for S3 and DynamoDB.
  4. Set up proper route tables and security groups

Security and Access Control

Identity Management

Databricks provides multiple options for identity management:

  1. Native Workspace Users

    • Basic authentication
    • Suitable for POCs and small teams
  2. SSO Integration

    • SAML 2.0 support
    • Integrates with major providers (Okta, Azure AD)
    • Recommended for enterprises

Unity Catalog Implementation

Unity Catalog provides fine-grained access control:

1
2
3
4
5
6
7
-- Example Unity Catalog permissions
GRANT
CREATE
ON CATALOG analytics TO data_engineers;
GRANT USAGE ON DATABASE
analytics.sales TO analysts;
GRANT SELECT ON TABLE analytics.sales.transactions TO reporting_role;

Network Security

  1. VPC Security Groups

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    resource "aws_security_group" "workspace_sg" {
      name_prefix = "databricks-workspace"
      vpc_id      = aws_vpc.databricks_vpc.id
    
      ingress {
        from_port = 443
        to_port   = 443
        protocol  = "tcp"
        self      = true
      }
    }
    
  2. AWS PrivateLink Configuration

    • Ensures secure communication
    • Avoids public internet exposure

Storage

Root Storage (also called as DBFS - Databricks File System)

  • Root bucket for workspace assets like Job metadata, notebook results…
  • I don’t recommend using this for any user actions.
  • Just set it and forget that this even exists.
  • Enable S3 intelligent-tiering on this bucket.

Unity Catalog Storage

Unity Catalog requires specific S3 buckets:

  1. Internal and Metadata Storage

    • Stores table, volumes and schema data and metadata
    • Managed by Databricks
      • As far as I know, it is not recommended to enable intelligent-tiering as it interferes with Delta Data Archival
  2. External Storage

    • Your data lake locations
    • Managed by your team
    • Better to create a dedicated IAM role for each group or use-case to avoid unintended data access.

Best Practices

Workspace Organization

  1. Environment Separation

    • Separate workspaces for dev/staging/prod
    • Consistent naming conventions
    • Automated deployment pipelines
    • Create a “cookie cutter” Terraform module for workspace deployment
  2. Resource Management

    • Implement cluster policies
    • Use instance pools for faster startup
    • Configure auto-termination

Cost Optimization

  1. Cluster Configuration

    • Utilize auto scaling
    • Set inactive termination time
    • Use fleets for better pricing and availability
    • Always use cluster policies
  2. Job Optimization

    • Use job clusters
    • Schedule during off-peak hours
    • Implement proper auto-scaling
    • Always add max retry limit and failure notifications.

Common Challenges and Solutions

Challenge 1: Network Connectivity

Problem: Slow cluster startup times due to network issues
Solution:

  • Use AWS PrivateLink
  • Implement proper subnet sizing
  • If you are using customer managed VPC, Configure efficient route tables.

Challenge 2: Permission Management

Problem: Complex permission requirements across teams
Solution:

  • Implement Unity Catalog
  • Use group-based access control
  • Regular access audits

Challenge 3: Cost Management

Problem: Unexpected cost spikes
Solution:

  • Set up budget alerts
  • Use cluster policies to control DBUs using dbus_per_hour attribute.
  • Additionally, you can also set Max compute resources per user to limit clusters per user.
  • Use spot instances where appropriate
  • Define a cost sharing or “chargeback” model for users

Conclusion

Databricks on AWS provides a powerful platform for data analytics and machine learning, but proper architecture and configuration are crucial for success. The key is to understand the interaction between Databricks components and AWS services, implementing security best practices, and optimizing for performance and cost.

Additional Resources


Have questions about Databricks on AWS? Feel free to reach out to me on LinkedIn.**