Vector Search Building Block¶
The Vector Search building block provides a modular framework for building GenAI pipelines that combine document parsing and extraction with vector databases for semantic search capabilities.
Overview¶
This building block offers an ingestion API that simplifies the process of chunking, embedding, and storing documents in vector databases. It's designed to save significant development and testing time by providing ready-to-use pipelines with extensible customization options.
IBM Products Used¶
This building block leverages the following IBM products and services:
- watsonx.ai: Foundation models and embedding services for document vectorization
- watsonx.data: Data lakehouse platform with integrated vector database support
- IBM Cloud Object Storage (COS): Scalable object storage for document repositories
- Milvus: Open-source vector database for semantic search (integrated with watsonx.data)
Features¶
- Ingestion Pipeline: Chunking, merging, and ingestion into vector databases
- Embedding Options: Dense, hybrid, or dual embeddings with selectable models
- Document Processing: Docling-based parsing with support for HTML, JSON, PDF, Markdown
- Flexible Chunking: Multiple chunking strategies (Docling hybrid, Markdown text splitter, recursive)
- REST API: Easy-to-use API with authentication
Supported Vector Databases¶
The building block provides integrations with multiple vector database platforms, each optimized for different use cases and deployment scenarios.
Available Integrations
- Milvus: High-performance vector database optimized for billion-scale vector search ✅ Available Now
- OpenSearch: Enterprise search with hybrid vector and keyword search capabilities 🔄 Planned
- DataStax Astra DB: Cloud-native vector database with global distribution 🔄 Planned
Key Capabilities¶
Document Loaders¶
- HTML documents
- JSON files
- PDF documents
- Markdown files
- Custom loaders
Embedding Models¶
- Dense embeddings: Traditional vector representations
- Hybrid embeddings: Combination of dense and sparse vectors
- Dual embeddings: Separate embeddings for different purposes
- Support for HuggingFace, watsonx.ai, and IBM models
Document Processing¶
- Docling/Markdown processing
- Picture annotation
- Table cleanup
- Custom processing pipelines
Chunking Strategies¶
- Docling hybrid chunker: Intelligent chunking based on document structure
- Markdown text splitter: Preserves markdown formatting
- Recursive text splitter: Hierarchical text splitting
Deployment Options¶
The Vector Search API can be deployed:
- Locally: For development and testing
- IBM Code Engine: Serverless container platform
- Red Hat OpenShift: Enterprise Kubernetes platform
- Docker: Containerized deployment
Getting Started¶
Prerequisites¶
Requirements
- watsonx.data environment with Milvus vector database
- Python 3.13 installed locally
- git installed locally
- IBM COS credentials
- Vector database credentials
Installation¶
-
Clone the repository:
git clone https://github.com/ibm-self-serve-assets/building-blocks.git cd building-blocks/data-for-ai/vector-search/ -
Create a Python virtual environment:
python3 -m venv virtual-env source virtual-env/bin/activate pip3 install -r requirements.txt -
Configure environment variables:
cp env .env -
Update
.envwith your credentials: - Vector DB credentials: Host, port, username, password
- IBM COS credentials: API key, endpoint, service instance ID
- REST_API_KEY: Set a unique value for API authentication
Starting the Application¶
Start the application locally:
python3 main.py
Or using Uvicorn:
uvicorn app.main:app --host 127.0.0.1 --port 4050 --reload
Access Swagger UI at: http://127.0.0.1:4050/docs
API Usage¶
Ingestion Endpoint¶
Endpoint: POST /ingest-files
Request Body:
{
"bucket_name": "<cos-bucket>",
"collection_name": "<collection-name>",
"chunk_type": "DOCLING_DOCS"
}
Parameters:
bucket_name: Name of the S3/COS bucket containing documentscollection_name: Target collection to create or upsert intochunk_type: Chunking strategy (DOCLING_DOCS, MARKDOWN, RECURSIVE)
Headers:
REST_API_KEY: <your-secret>
Content-Type: application/json
Example using Python:
import json, requests
url = "http://127.0.0.1:4050/ingest-files"
payload = json.dumps({
"bucket_name": "<cos-bucket>",
"collection_name": "<collection-name>",
"chunk_type": "DOCLING_DOCS"
})
headers = {
"REST_API_KEY": "<your-secret>",
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, data=payload)
print(response.text)
Use Cases¶
- Semantic Search: Find documents based on meaning, not just keywords
- RAG Pipelines: Retrieval-augmented generation for LLMs
- Knowledge Bases: Build searchable knowledge repositories
- Document Discovery: Find similar documents across large collections
- Question Answering: Retrieve relevant context for Q&A systems
Customization¶
The API supports extensive customization:
- Collection Schema: Configurable via JSON templates
- Embedding Models: Choose from multiple providers and models
- Document Processing: Custom processing pipelines
- Chunking Strategies: Adjust chunk size and overlap
- Metadata Extraction: Custom metadata fields
Coming Soon¶
Upcoming Features
- .png and .jpg VLM Support
- Additional docling processing functions (image annotation, table exports)
- Enhanced error logging with structured logs
- Performance optimization for large-scale ingestion
- Additional vector database integrations
Performance Considerations¶
Optimization Guidelines
- Batch Processing: Process multiple documents in parallel
- Chunk Size: Balance between context and retrieval precision
- Embedding Dimensions: Higher dimensions = more accuracy but slower
- Index Configuration: Optimize for your query patterns
Resources¶
Team¶
Created and Architected By: Anand Das, Anindya Neogi, Joseph Kim, Shivam Solanki
Support¶
For issues or questions, please refer to the GitHub repository or open an issue.