Understanding Databricks Architecture on AWS: A Basic Guide

As a Senior Software Engineer who has led the deployment of Databricks across multiple workspaces at Nike, I’ve gained deep insights into how Databricks operates on AWS. In this post, I’ll break down the architecture, share key learnings, and provide best practices for enterprise deployments.

Note: This post is focused on “Classic” Databricks that deploys into your AWS account. Databricks has launched Serverless offerings, which we will look into in a follow-up post.

Core Components
Network Architecture
Security and Access Control
Storage Integration
Best Practices
Common Challenges and Solutions

Core Components

Control Plane vs Data Plane

Databricks on AWS follows a distinctive architecture that separates the control plane from the data plane:

Control Plane:

Managed by Databricks in their AWS account
Handles workspace management, cluster provisioning, and job scheduling
Operates in specific AWS regions (e.g., us-west-2, us-east-1)
Communicates with workspace VPC through AWS PrivateLink (if configured) otherwise uses AWS inter-account networking backbone

Data Plane:

Runs in your AWS account
Contains actual compute resources (EC2 instances)
Processes data within your VPC
Accesses data from your S3 buckets

Workspace Components

Web Application
- Provides notebook interface
- Manages users and permissions
- Handles job scheduling and monitoring
Cluster Manager
- Provisions and manages EC2 instances
- Handles auto-scaling
- Manages cluster lifecycle
Metastore Service
- Manages table metadata
- Integrates with Unity Catalog
- Handles schema evolution

Network Architecture

VPC Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Example Terraform configuration for Databricks VPC

provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      Environment = "Test"
      workspace   = "demo-workspace" # Adding workspace name as tag helps with identifying resources.
    }
  }
}

resource "aws_vpc" "databricks_vpc" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "databricks-vpc"
  }
}

# Private subnets for worker nodes
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.databricks_vpc.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "databricks-private-${count.index + 1}"
  }
}

Networking Best Practices

Use at least two private subnets across different AZs, 3 for enterprise deployments
Implement NAT Gateway for outbound traffic
Configure AWS PrivateLink for secure control plane communication and also for S3 and DynamoDB.
Set up proper route tables and security groups

Security and Access Control

Identity Management

Databricks provides multiple options for identity management:

Native Workspace Users
- Basic authentication
- Suitable for POCs and small teams
SSO Integration
- SAML 2.0 support
- Integrates with major providers (Okta, Azure AD)
- Recommended for enterprises

Unity Catalog Implementation

Unity Catalog provides fine-grained access control:

1
2
3
4
5
6
7
-- Example Unity Catalog permissions
GRANT
CREATE
ON CATALOG analytics TO data_engineers;
GRANT USAGE ON DATABASE
analytics.sales TO analysts;
GRANT SELECT ON TABLE analytics.sales.transactions TO reporting_role;

Network Security

VPC Security Groups

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
resource "aws_security_group" "workspace_sg" {
  name_prefix = "databricks-workspace"
  vpc_id      = aws_vpc.databricks_vpc.id

  ingress {
    from_port = 443
    to_port   = 443
    protocol  = "tcp"
    self      = true
  }
}

AWS PrivateLink Configuration
- Ensures secure communication
- Avoids public internet exposure

Storage

Root Storage (also called as DBFS - Databricks File System)

Root bucket for workspace assets like Job metadata, notebook results…
I don’t recommend using this for any user actions.
Just set it and forget that this even exists.
Enable S3 intelligent-tiering on this bucket.

Unity Catalog Storage

Unity Catalog requires specific S3 buckets:

Internal and Metadata Storage
- Stores table, volumes and schema data and metadata
- Managed by Databricks
  - As far as I know, it is not recommended to enable intelligent-tiering as it interferes with Delta Data Archival
External Storage
- Your data lake locations
- Managed by your team
- Better to create a dedicated IAM role for each group or use-case to avoid unintended data access.

Best Practices

Workspace Organization

Environment Separation
- Separate workspaces for dev/staging/prod
- Consistent naming conventions
- Automated deployment pipelines
- Create a “cookie cutter” Terraform module for workspace deployment
Resource Management
- Implement cluster policies
- Use instance pools for faster startup
- Configure auto-termination

Cost Optimization

Cluster Configuration
- Utilize auto scaling
- Set inactive termination time
- Use fleets for better pricing and availability
- Always use cluster policies
Job Optimization
- Use job clusters
- Schedule during off-peak hours
- Implement proper auto-scaling
- Always add max retry limit and failure notifications.

Common Challenges and Solutions

Challenge 1: Network Connectivity

Problem: Slow cluster startup times due to network issues
Solution:

Use AWS PrivateLink
Implement proper subnet sizing
If you are using customer managed VPC, Configure efficient route tables.

Challenge 2: Permission Management

Problem: Complex permission requirements across teams
Solution:

Implement Unity Catalog
Use group-based access control
Regular access audits

Challenge 3: Cost Management

Problem: Unexpected cost spikes
Solution:

Set up budget alerts
Use cluster policies to control DBUs using dbus_per_hour attribute.
Additionally, you can also set Max compute resources per user to limit clusters per user.
Use spot instances where appropriate
Define a cost sharing or “chargeback” model for users

Conclusion

Databricks on AWS provides a powerful platform for data analytics and machine learning, but proper architecture and configuration are crucial for success. The key is to understand the interaction between Databricks components and AWS services, implementing security best practices, and optimizing for performance and cost.

Additional Resources

Have questions about Databricks on AWS? Feel free to reach out to me on LinkedIn.**

Table of Contents#

Core Components#

Control Plane vs Data Plane#

Workspace Components#

Network Architecture#

VPC Configuration#

Networking Best Practices#

Security and Access Control#

Identity Management#

Unity Catalog Implementation#

Network Security#

Storage#

Root Storage (also called as DBFS - Databricks File System)#

Unity Catalog Storage#

Best Practices#

Workspace Organization#

Cost Optimization#

Common Challenges and Solutions#

Challenge 1: Network Connectivity#

Challenge 2: Permission Management#

Challenge 3: Cost Management#

Conclusion#

Additional Resources#

Table of Contents