Vector Search Building Block¶

The Vector Search building block provides a modular framework for building GenAI pipelines that combine document parsing and extraction with vector databases for semantic search capabilities.

Overview¶

This building block offers an ingestion API that simplifies the process of chunking, embedding, and storing documents in vector databases. It's designed to save significant development and testing time by providing ready-to-use pipelines with extensible customization options.

Vector Search Architecture

IBM Products Used¶

This building block leverages the following IBM products and services:

watsonx.ai: Foundation models and embedding services for document vectorization
watsonx.data: Data lakehouse platform with integrated vector database support
IBM Cloud Object Storage (COS): Scalable object storage for document repositories
Milvus: Open-source vector database for semantic search (integrated with watsonx.data)

Features¶

Ingestion Pipeline: Chunking, merging, and ingestion into vector databases
Embedding Options: Dense, hybrid, or dual embeddings with selectable models
Document Processing: Docling-based parsing with support for HTML, JSON, PDF, Markdown
Flexible Chunking: Multiple chunking strategies (Docling hybrid, Markdown text splitter, recursive)
REST API: Easy-to-use API with authentication

Supported Vector Databases¶

The building block provides integrations with multiple vector database platforms, each optimized for different use cases and deployment scenarios.

Available Integrations

Milvus: High-performance vector database optimized for billion-scale vector search ✅ Available Now
OpenSearch: Enterprise search with hybrid vector and keyword search capabilities 🔄 Planned
DataStax Astra DB: Cloud-native vector database with global distribution 🔄 Planned

Key Capabilities¶

Document Loaders¶

HTML documents
JSON files
PDF documents
Markdown files
Custom loaders

Embedding Models¶

Dense embeddings: Traditional vector representations
Hybrid embeddings: Combination of dense and sparse vectors
Dual embeddings: Separate embeddings for different purposes
Support for HuggingFace, watsonx.ai, and IBM models

Document Processing¶

Docling/Markdown processing
Picture annotation
Table cleanup
Custom processing pipelines

Chunking Strategies¶

Docling hybrid chunker: Intelligent chunking based on document structure
Markdown text splitter: Preserves markdown formatting
Recursive text splitter: Hierarchical text splitting

Deployment Options¶

The Vector Search API can be deployed:

Locally: For development and testing
IBM Code Engine: Serverless container platform
Red Hat OpenShift: Enterprise Kubernetes platform
Docker: Containerized deployment

Getting Started¶

Prerequisites¶

Requirements

watsonx.data environment with Milvus vector database
Python 3.13 installed locally
git installed locally
IBM COS credentials
Vector database credentials

Installation¶

Clone the repository:

git clone https://github.com/ibm-self-serve-assets/building-blocks.git
cd building-blocks/data-for-ai/vector-search/

Create a Python virtual environment:

python3 -m venv virtual-env
source virtual-env/bin/activate
pip3 install -r requirements.txt

Configure environment variables:
```
cp env .env
```
Update .env with your credentials:
Vector DB credentials: Host, port, username, password
IBM COS credentials: API key, endpoint, service instance ID
REST_API_KEY: Set a unique value for API authentication

Starting the Application¶

Start the application locally:

python3 main.py

Or using Uvicorn:

uvicorn app.main:app --host 127.0.0.1 --port 4050 --reload

Access Swagger UI at: http://127.0.0.1:4050/docs

API Usage¶

Ingestion Endpoint¶

Endpoint: POST /ingest-files

Request Body:

{
    "bucket_name": "<cos-bucket>",
    "collection_name": "<collection-name>",
    "chunk_type": "DOCLING_DOCS"
}

Parameters:

bucket_name: Name of the S3/COS bucket containing documents
collection_name: Target collection to create or upsert into
chunk_type: Chunking strategy (DOCLING_DOCS, MARKDOWN, RECURSIVE)

Headers:

REST_API_KEY: <your-secret>
Content-Type: application/json

Example using Python:

import json, requests

url = "http://127.0.0.1:4050/ingest-files"

payload = json.dumps({
    "bucket_name": "<cos-bucket>",
    "collection_name": "<collection-name>",
    "chunk_type": "DOCLING_DOCS"
})

headers = {
    "REST_API_KEY": "<your-secret>",
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers, data=payload)
print(response.text)

Use Cases¶

Semantic Search: Find documents based on meaning, not just keywords
RAG Pipelines: Retrieval-augmented generation for LLMs
Knowledge Bases: Build searchable knowledge repositories
Document Discovery: Find similar documents across large collections
Question Answering: Retrieve relevant context for Q&A systems

Customization¶

The API supports extensive customization:

Collection Schema: Configurable via JSON templates
Embedding Models: Choose from multiple providers and models
Document Processing: Custom processing pipelines
Chunking Strategies: Adjust chunk size and overlap
Metadata Extraction: Custom metadata fields

Coming Soon¶

Upcoming Features

.png and .jpg VLM Support
Additional docling processing functions (image annotation, table exports)
Enhanced error logging with structured logs
Performance optimization for large-scale ingestion
Additional vector database integrations

Performance Considerations¶

Optimization Guidelines

Batch Processing: Process multiple documents in parallel
Chunk Size: Balance between context and retrieval precision
Embedding Dimensions: Higher dimensions = more accuracy but slower
Index Configuration: Optimize for your query patterns

Resources¶

Team¶

Created and Architected By: Anand Das, Anindya Neogi, Joseph Kim, Shivam Solanki

Support¶

For issues or questions, please refer to the GitHub repository or open an issue.