Skip to content

Data Ingestion Building Block

Comprehensive data ingestion solutions for IBM watsonx.data covering unstructured and structured data sources.

Overview

The Data Ingestion building block provides a complete framework for ingesting various types of data into IBM watsonx.data. It supports multiple data formats and sources, enabling seamless integration of diverse data types into your data lakehouse.


IBM Products Used

This building block leverages the following IBM products and services:


Features

Unstructured Data Ingestion

  • Document processing (PDF, DOCX, TXT, HTML)
  • Image and media file handling
  • Email and messaging data extraction
  • Web scraping and crawling capabilities

Structured Data Ingestion

  • RDBMS connectors (DB2, PostgreSQL, MySQL, Oracle)
  • Data warehouse integration
  • CDC (Change Data Capture) pipelines
  • Batch and streaming ingestion modes

Components

IBM UDI (Unstructured Data Ingestion)

IBM UDI provides specialized capabilities for ingesting unstructured data from various sources:

  • Document Ingestion: Process documents in multiple formats
  • Media Processing: Handle images, videos, and audio files
  • Content Extraction: Extract text and metadata from unstructured sources
  • Format Conversion: Convert between different file formats

Repository Path: data-ingestion/assets/unstructured-data/

Structured Data Ingestion

Connect to and ingest data from relational databases and data warehouses:

  • Database Connectors: Pre-built connectors for major RDBMS platforms
  • CDC Support: Real-time change data capture for incremental updates
  • Batch Processing: Efficient bulk data loading
  • Schema Mapping: Automatic schema detection and mapping

Repository Path: data-ingestion/assets/structured-data/


Getting Started

Prerequisites

Requirements

  1. IBM watsonx.data environment
  2. IBM Cloud Object Storage (COS) credentials
  3. Source system credentials (database, API keys, etc.)
  4. Python 3.12+ installed locally
  5. git installed locally

Installation

  1. Clone the repository:

    git clone https://github.com/ibm-self-serve-assets/building-blocks.git
    cd building-blocks/data-for-ai/data-ingestion/
    

  2. Choose your ingestion type and navigate to the appropriate directory:

  3. For unstructured data: cd assets/unstructured-data/
  4. For structured data: cd assets/structured-data/

  5. Follow the specific README instructions in each directory for setup and configuration.


Use Cases

  • Data Lake Population: Ingest diverse data sources into watsonx.data
  • Real-time Data Pipelines: Stream data from operational systems
  • Document Processing: Extract and index document content
  • Database Migration: Move data from legacy systems to watsonx.data
  • API Data Integration: Pull data from external APIs and services
  • Log Analytics: Ingest and analyze application and system logs

Architecture Patterns

Batch Ingestion Pattern

Source System → Staging (COS) → Transformation → watsonx.data

Streaming Ingestion Pattern

Source System → CDC → Real-time Processing → watsonx.data

Hybrid Pattern

Source System → Batch/Stream Router → Processing → watsonx.data

Best Practices

Ingestion Best Practices

  • Data Quality: Implement validation checks at ingestion time
  • Error Handling: Design robust retry and error recovery mechanisms
  • Performance: Use parallel processing for large-scale ingestion
  • Monitoring: Track ingestion metrics and set up alerts
  • Security: Encrypt data in transit and at rest
  • Schema Evolution: Plan for schema changes in source systems

Performance Considerations

  • Batch Size: Optimize batch sizes for your data volume
  • Parallelization: Use multiple workers for concurrent ingestion
  • Network Bandwidth: Consider network capacity for large data transfers
  • Resource Allocation: Allocate sufficient compute and memory resources
  • Incremental Loading: Use CDC for efficient incremental updates

Resources


Support

For issues or questions, please refer to the GitHub repository or open an issue.