As a Senior Software Engineer who has led the deployment of Databricks across multiple workspaces at Nike, I’ve gained deep insights into how Databricks operates on AWS. In this post, I’ll break down the architecture, share key learnings, and provide best practices for enterprise deployments.
Note: This post is focused on “Classic” Databricks that deploys into your AWS account. Databricks has launched Serverless offerings, which we will look into in a follow-up post.
Table of Contents
- Core Components
- Network Architecture
- Security and Access Control
- Storage Integration
- Best Practices
- Common Challenges and Solutions
Core Components
Control Plane vs Data Plane
Databricks on AWS follows a distinctive architecture that separates the control plane from the data plane:
Control Plane:
- Managed by Databricks in their AWS account
- Handles workspace management, cluster provisioning, and job scheduling
- Operates in specific AWS regions (e.g., us-west-2, us-east-1)
- Communicates with workspace VPC through AWS PrivateLink (if configured) otherwise uses AWS inter-account networking backbone
Data Plane:
- Runs in your AWS account
- Contains actual compute resources (EC2 instances)
- Processes data within your VPC
- Accesses data from your S3 buckets
Workspace Components
Web Application
- Provides notebook interface
- Manages users and permissions
- Handles job scheduling and monitoring
Cluster Manager
- Provisions and manages EC2 instances
- Handles auto-scaling
- Manages cluster lifecycle
Metastore Service
- Manages table metadata
- Integrates with Unity Catalog
- Handles schema evolution
Network Architecture
VPC Configuration
|
|
Networking Best Practices
- Use at least two private subnets across different AZs, 3 for enterprise deployments
- Implement NAT Gateway for outbound traffic
- Configure AWS PrivateLink for secure control plane communication and also for S3 and DynamoDB.
- Set up proper route tables and security groups
Security and Access Control
Identity Management
Databricks provides multiple options for identity management:
Native Workspace Users
- Basic authentication
- Suitable for POCs and small teams
SSO Integration
- SAML 2.0 support
- Integrates with major providers (Okta, Azure AD)
- Recommended for enterprises
Unity Catalog Implementation
Unity Catalog provides fine-grained access control:
|
|
Network Security
VPC Security Groups
1 2 3 4 5 6 7 8 9 10 11
resource "aws_security_group" "workspace_sg" { name_prefix = "databricks-workspace" vpc_id = aws_vpc.databricks_vpc.id ingress { from_port = 443 to_port = 443 protocol = "tcp" self = true } }
AWS PrivateLink Configuration
- Ensures secure communication
- Avoids public internet exposure
Storage
Root Storage (also called as DBFS - Databricks File System)
- Root bucket for workspace assets like Job metadata, notebook results…
- I don’t recommend using this for any user actions.
- Just set it and forget that this even exists.
- Enable S3 intelligent-tiering on this bucket.
Unity Catalog Storage
Unity Catalog requires specific S3 buckets:
Internal and Metadata Storage
- Stores table, volumes and schema data and metadata
- Managed by Databricks
- As far as I know, it is not recommended to enable intelligent-tiering as it interferes with Delta Data Archival
External Storage
- Your data lake locations
- Managed by your team
- Better to create a dedicated IAM role for each group or use-case to avoid unintended data access.
Best Practices
Workspace Organization
Environment Separation
- Separate workspaces for dev/staging/prod
- Consistent naming conventions
- Automated deployment pipelines
- Create a “cookie cutter” Terraform module for workspace deployment
Resource Management
- Implement cluster policies
- Use instance pools for faster startup
- Configure auto-termination
Cost Optimization
Cluster Configuration
- Utilize auto scaling
- Set inactive termination time
- Use fleets for better pricing and availability
- Always use cluster policies
Job Optimization
- Use job clusters
- Schedule during off-peak hours
- Implement proper auto-scaling
- Always add max retry limit and failure notifications.
Common Challenges and Solutions
Challenge 1: Network Connectivity
Problem: Slow cluster startup times due to network issues
Solution:
- Use AWS PrivateLink
- Implement proper subnet sizing
- If you are using customer managed VPC, Configure efficient route tables.
Challenge 2: Permission Management
Problem: Complex permission requirements across teams
Solution:
- Implement Unity Catalog
- Use group-based access control
- Regular access audits
Challenge 3: Cost Management
Problem: Unexpected cost spikes
Solution:
- Set up budget alerts
- Use cluster policies to control DBUs using
dbus_per_hour
attribute. - Additionally, you can also set
Max compute resources per user
to limit clusters per user. - Use spot instances where appropriate
- Define a cost sharing or “chargeback” model for users
Conclusion
Databricks on AWS provides a powerful platform for data analytics and machine learning, but proper architecture and configuration are crucial for success. The key is to understand the interaction between Databricks components and AWS services, implementing security best practices, and optimizing for performance and cost.
Additional Resources
Have questions about Databricks on AWS? Feel free to reach out to me on LinkedIn.**