Zero-Copy Lakehouse Building Block¶
The Zero-Copy Lakehouse building block enables seamless querying across databases, warehouses, and cloud object stores without data duplication, reducing costs and latency.
What is Zero-Copy Lakehouse?¶
A Zero-Copy Lakehouse is a data architecture approach where multiple analytics, AI, and ML tools can access and process the same underlying data without duplicating or moving it across systems.
Instead of copying data between warehouses, lakes, and ML pipelines, a zero-copy approach enables shared access with governance and performance optimizations.
Why It Matters¶
Traditional setups involve ETL (Extract, Transform, Load) pipelines that duplicate data into multiple systems, leading to:
- Higher storage costs
- Governance risks
- Data inconsistencies
- Processing delays
Zero-copy lakehouse eliminates data silos by providing a single source of truth for BI, AI, ML, and analytics workloads.
Benefits¶
Key Advantages
- Cost Savings: No redundant storage costs
- Faster Insights: Avoids ETL delays
- Single Source of Truth: Reduces risk of inconsistent data
- Flexibility: Multiple engines/tools access the same data
- Governance: One layer controls access everywhere
IBM watsonx.data & Zero Copy¶
In the IBM watsonx.data lakehouse:
- Built on open table formats (Iceberg/Delta)
- Provides federated query capability (query S3, Db2, Cloud Object Storage, external warehouses, all in place)
- Ensures zero-copy data access with no need to ETL into another system
Architecture¶
flowchart LR
subgraph IBM_Cloud["IBM watsonx.data"]
Engine["Presto Engine"]
end
COS["IBM Cloud Object Storage (COS)
cos_catalog"] --> Engine
S3["Amazon S3
s3_catalog"] --> Engine
DB2["DB2 OLTP
(db2_catalog)"] --> Engine
Engine --> Queries["Federated SQL Queries"]
Queries --> Users["Data Scientists / Analysts"]
Watsonx.data Setup Automation¶
The building block provides a Python script (watsonxdata_setup.py) that automates the setup of IBM watsonx.data resources using official APIs.
What Gets Configured¶
- IBM Cloud Object Storage (COS) bucket
- Amazon S3 bucket
- DB2 Database connection (SaaS on IBM Cloud)
- Presto engine catalog associations
- Schemas for COS and S3
Prerequisites¶
Requirements
- Access to:
- IBM Cloud Account and AWS Account
- watsonx.data SaaS instance on IBM Cloud
- DB2 Database SaaS instance on IBM Cloud
-
Python 3.8+ installed locally
-
Install dependencies:
pip install requests -
Set your IBM Cloud API key:
export IBM_API_KEY="your-ibm-cloud-api-key"
Getting Started¶
Step 1: Clone the Repository¶
git clone https://github.com/ibm-self-serve-assets/building-blocks.git
cd building-blocks/data-for-ai/zero-copy-lakehouse/assets/setup-lakehouse
Step 2: Configure Settings¶
Edit config.json with your environment details:
| Configuration | Description | Example |
|---|---|---|
region |
IBM Cloud region | us-south |
auth_instance_id |
watsonx.data deployment ID | crn:v1:bluemix:public:lakehouse:... |
| COS Configuration | ||
bucket_display_name |
Display name for COS bucket | cos_bucket |
bucket_details.bucket_name |
COS bucket name | watsonxdata-demo |
bucket_details.endpoint |
COS endpoint | https://s3.us-south.cloud-object-storage.appdomain.cloud |
associated_catalog.catalog_name |
Catalog name | cos_catalog |
| S3 Configuration | ||
bucket_display_name |
Display name for S3 bucket | amazon_S3 |
bucket_details.bucket_name |
AWS S3 bucket name | watsonxdata |
bucket_details.endpoint |
S3 endpoint | https://s3.us-east-2.amazonaws.com |
associated_catalog.catalog_name |
Catalog name | s3_catalog |
| DB2 Configuration | ||
database_name |
Database name | bludb |
hostname |
DB2 hostname | 87612426-7efe-...db2.ibmappdomain.cloud |
port |
DB2 port | 31687 |
catalog_name |
Catalog name | db2_catalog |
Step 3: Run Setup Automation¶
python watsonxdata_setup.py
This will:
- Authenticate with IBM Cloud IAM
- Create a project
- Create a catalog
- Register storage buckets
- Configure database connections
Demo: Zero-Copy Lakehouse in Action¶
Load Data into Storage Sources¶
Load Data into Amazon S3¶
# Create a folder named 'account' in S3
aws s3 cp account.csv s3://<your-s3-bucket-name>/account/
Load Data into IBM COS¶
# Create a bucket and folder 'customer' in COS, then upload
ibmcloud cos upload --bucket <your-cos-bucket-name> --key customer/customer.csv --file customer.csv
Load Data into Db2¶
CREATE TABLE customer_info.customer_summary (
customer_id VARCHAR(50),
total_spend DECIMAL(10,2),
last_purchase_date DATE
);
IMPORT FROM customer_summary.csv OF DEL
INSERT INTO customer_info.customer_summary;
Define External Tables in watsonx.data¶
Login to the watsonx.data console, go to the Query Workspace, and run:
Create COS Table¶
CREATE TABLE "cos_catalog"."customer"."customer" (
customer_id VARCHAR,
customer_name VARCHAR,
region VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://watsonxdata-demo/customer/'
);
Create S3 Table¶
CREATE TABLE "s3_catalog"."account"."account" (
account_id VARCHAR,
balance VARCHAR,
customer_id VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://watsonxdata/account/'
);
Query Across All Three Data Sources¶
SELECT *
FROM "s3_catalog"."account"."account" a
JOIN "cos_catalog"."customer"."customer" c ON a.customer_id = c.customer_id
JOIN "db2_catalog"."customer_info"."customer_summary" cs ON cs.customer_id = a.customer_id;
Zero-Copy in Action
This query demonstrates accessing S3, COS, and Db2 data directly without duplication, enabling Zero-Copy Lakehouse insights.
Use Cases¶
- Multi-Cloud Analytics: Query data across AWS, IBM Cloud, and on-premises databases
- Cost Optimization: Eliminate redundant data copies and storage costs
- Real-Time Insights: Access live data without ETL delays
- Data Governance: Centralized access control and compliance
- AI/ML Pipelines: Direct access to training data without movement
Best Practices¶
Implementation Guidelines
- Keep your API keys secure and never commit them to git
- Ensure your S3/COS buckets and Db2 tables exist before running the demo
- For larger workloads, consider optimizing with Iceberg/Delta formats
- Use appropriate access controls and encryption for sensitive data
- Monitor query performance and optimize catalog configurations
IBM Products Used¶
This building block leverages the following IBM products and services:
IBM watsonx.data¶
Open, hybrid, and governed data store built on an open lakehouse architecture.
- Purpose: Federated query engine (Presto) for zero-copy data access across multiple sources
- Documentation: IBM watsonx.data Documentation
- Getting Started: watsonx.data Getting Started
- Provisioning: Provisioning watsonx.data
IBM Cloud Object Storage (COS)¶
Scalable, secure object storage for unstructured data with S3-compatible API.
- Purpose: Data lake storage with Iceberg/Delta table formats
- Documentation: IBM COS Documentation
- Integration: COS with watsonx.data
- Getting Started: COS Getting Started
IBM Db2 Database¶
Enterprise-grade relational database for OLTP workloads.
- Purpose: Operational database for real-time transactional data
- Documentation: IBM Db2 Documentation
- SaaS Offering: Db2 on Cloud
- Provisioning: Provisioning Db2
Amazon S3¶
Cloud object storage service (external integration).
- Purpose: Multi-cloud data access without data movement
- Documentation: AWS S3 Documentation
- Integration: S3 with watsonx.data
Presto Query Engine¶
Distributed SQL query engine for big data analytics (included in watsonx.data).
- Purpose: Federated query execution across multiple data sources
- Documentation: Presto Documentation
- watsonx.data Integration: Presto in watsonx.data
Resources¶
- GitHub Repository
- IBM watsonx.data Documentation
- Adding Storage and Catalog Pair
- Adding Database and Catalog Pair
Support¶
For issues or questions, please refer to the GitHub repository or contact the IBM watsonx.data support team.