OpenChatBI Documentation

OpenChatBI

OpenChatBI is an open source, chat-based intelligent BI tool powered by large language models, designed to help users query, analyze, and visualize data through natural language conversations. Built on LangGraph and LangChain ecosystem, it provides chat agents and workflows that support natural language to SQL conversion and streamlined data analysis.

Join the Slack channel to discuss: https://join.slack.com/t/openchatbicommunity/shared_invite/zt-3jpzpx9mv-Sk88RxpO4Up0L~YTZYf4GQ

Core Features

Natural Language Interaction: Get data analysis results by asking questions in natural language
Automatic SQL Generation: Convert natural language queries into SQL statements using advanced text2sql workflows with schema linking and well organized prompt engineering
Data Visualization: Generate intuitive data visualizations (via plotly)
Data Catalog Management: Automatically discovers and indexes database table structures, supports flexible catalog storage backends with vector-based or BM25-based retrieval, and easily maintains business explanations for tables and columns as well as optimizes Prompts.
Time Series Forecasting: Forecasting models deployed in-house that can be called as tools
Code Execution: Execute Python code for data analysis and visualization
Interactive Problem-Solving: Proactively ask users for more context when information is incomplete
Persistent Memory: Conversation management and user characteristic memory based on LangGraph checkpointing
MCP Support: Integration with MCP tools by configuration
Knowledge Base Integration: Answer complex questions by combining catalog based knowledge retrival and external knowledge base retrival (via MCP tools)
Web UI Interface: Provide 2 sample UI: simple and streaming web interfaces using Gradio and Streamlit, easy to integrate with other web applications

Roadmap

Anomaly Detection Algorithm: Time series anomaly detection
Root Cause Analysis Algorithm: Multi-dimensional drill-down capabilities for anomaly investigation

Getting started

Installation & Setup

Prerequisites

Python 3.11 or higher
Access to a supported LLM provider (OpenAI, Anthropic, etc.)
Data Warehouse (Database) credentials (like Presto, PostgreSQL, MySQL, etc.)
(Optional) Embedding model for vector-based retrieval - if not available, BM25-based retrieval will be used
(Optional) Docker - required only for docker executor mode

Note on Chinese Text Segmentation: For better Chinese text retrieval, jieba is used for word segmentation. However, jieba is not compatible with Python 3.12+. On Python 3.12 and higher, the system automatically falls back to simple punctuation-based segmentation for Chinese text.

Installation

Using uv (recommended):

git clone git@github.com:zhongyu09/openchatbi
uv sync

Using pip:

pip install openchatbi

For development:

git clone git@github.com:zhongyu09/openchatbi
uv sync --group dev

Optional: If you want to use pysqlite3 (newer SQLite builds), you can install it manually. If build fails, install SQLite first:

On macOS, try to install sqlite using Homebrew:

brew install sqlite
brew info sqlite
export LDFLAGS="-L/opt/homebrew/opt/sqlite/lib"
export CPPFLAGS="-I/opt/homebrew/opt/sqlite/include"

On Amazon Linux / RHEL / CentOS:

sudo yum install sqlite-devel

On Ubuntu / Debian:

sudo apt-get update
sudo apt-get install libsqlite3-dev

Run Demo

Run demo using example dataset from spider dataset. You need to provide “YOUR OPENAI API KEY” or change config to use other LLM providers.

Note: The demo example includes embedding model configuration. If you want to run without an embedding model, you can remove the embedding_model section in the config - BM25 retrieval will be used automatically.

cp example/config.yaml openchatbi/config.yaml
sed -i 's/YOUR_API_KEY_HERE/[YOUR OPENAI API KEY]/g' openchatbi/config.yaml
python run_streamlit_ui.py

Configuration

Create configuration file

Copy the configuration template:

cp openchatbi/config.yaml.template openchatbi/config.yaml

Or create an empty YAML file.

Configure your LLMs:

# Select which provider to use
default_llm: openai

# Define one or more providers
llm_providers:
  openai:
    default_llm:
      class: langchain_openai.ChatOpenAI
      params:
        api_key: YOUR_API_KEY_HERE
        model: gpt-4.1
        temperature: 0.02
        max_tokens: 8192

    # Optional: Embedding model for vector-based retrieval and memory tools
    # If not configured, BM25-based retrieval will be used, and the memory tools will not work
    embedding_model:
      class: langchain_openai.OpenAIEmbeddings
      params:
        api_key: YOUR_API_KEY_HERE
        model: text-embedding-3-large
        chunk_size: 1024

Configure your data warehouse:

organization: Your Company
dialect: presto
data_warehouse_config:
  uri: "presto://user@host:8080/catalog/schema"
  include_tables:
    - your_table_name
  database_name: "catalog.schema"

Running the Application

Invoking LangGraph:

export CONFIG_FILE=YOUR_CONFIG_FILE_PATH

from openchatbi import get_default_graph

graph = get_default_graph()
graph.invoke({"messages": [{"role": "user", "content": "Show me ctr trends for the past 7 days"}]},
    config={"configurable": {"thread_id": "1"}})

# System-generated SQL
SELECT date, SUM(clicks)/SUM(impression) AS ctr
FROM ad_performance
WHERE date >= CURRENT_DATE - 7 DAYS
GROUP BY date
ORDER BY date;

Sample Web UI:

Streamlit based UI:

streamlit run sample_ui streamlit_ui.py

Run Gradio based UI:

python sample_ui/streaming_ui.py

Configuration Instructions

The configuration template is provided at config.yaml.template. Key configuration sections include:

Basic Settings

organization: Organization name (e.g., “Your Company”)
dialect: Database dialect (e.g., “presto”)
bi_config_file: Path to BI configuration file (e.g., “example/bi.yaml”)

Catalog Store Configuration

catalog_store: Configuration for data catalog storage
- store_type: Storage type (e.g., “file_system”)
- data_path: Path to catalog data stored by file system (e.g., “./example”)

Data Warehouse Configuration

data_warehouse_config: Database connection settings
- uri: Connection string for your database
- include_tables: List of tables to include in catalog, leave empty to include all tables
- database_name: Database name for catalog
- token_service: Token service URL (for data warehouse that need token authentication like Presto)
- user_name / password: Token service credentials

LLM Configuration

Various LLMs are supported based on LangChain, see LangChain API Document(https://python.langchain.com/api_reference/reference.html#integrations) for full list that support chat_models. You can configure different LLMs for different tasks:

default_llm: Primary language model for general tasks
embedding_model: (Optional) Model for embedding generation. If not configured, BM25-based text retrieval will be used as fallback, and the memory tools will not work
text2sql_llm: (Optional) Specialized model for SQL generation. If not configured, uses default_llm

Multiple providers (optional):

Configure multiple providers under llm_providers and select with default_llm: <provider_name>.
In sample_ui/streamlit_ui.py, a provider dropdown appears when llm_providers is configured.
In sample_api/async_api.py, pass provider in the /chat/stream request body.

Commonly used LLM providers and their corresponding classes and installation commands:

Anthropic: langchain_anthropic.ChatAnthropic, pip install langchain-anthropic
OpenAI: langchain_openai.ChatOpenAI, pip install langchain-openai
Azure OpenAI: langchain_openai.AzureChatOpenAI, pip install langchain-openai
Google Vertex AI: langchain_google_vertexai.ChatVertexAI, pip install langchain-google-vertexai
Bedrock: langchain_aws.ChatBedrock, pip install langchain-aws
Huggingface: langchain_huggingface.ChatHuggingFace, pip install langchain-huggingface
Deepseek: langchain_deepseek.ChatDeepSeek, pip install langchain-deepseek
Ollama: langchain_ollama.ChatOllama, pip install langchain-ollama

Advanced Configuration

OpenChatBI supports sophisticated customization through prompt engineering and catalog management features:

Prompt Engineering Configuration: Customize system prompts, business glossaries, and data warehouse introductions
Data Catalog Management: Configure table metadata, column descriptions, and SQL generation rules
Business Rules: Define table selection criteria and domain-specific SQL constraints
Forecasting Service: Configure the forecasting service url and prompt based on your own deployment

For detailed configuration options and examples, see the Advanced Features section.

Architecture Overview

OpenChatBI is built using a modular architecture with clear separation of concerns:

LangGraph Workflows: Core orchestration using state machines for complex multi-step processes
Catalog Management: Flexible data catalog system with intelligent retrieval (vector-based or BM25 fallback)
Text2SQL Pipeline: Advanced natural language to SQL conversion with schema linking
Code Execution: Sandboxed Python execution environment for data analysis
Tool Integration: Extensible tool system for human interaction and knowledge search
Persistent Memory: SQLite-based conversation state management

Technology Stack

Frameworks: LangGraph, LangChain, FastAPI, Gradio/Streamlit
Large Language Models: Azure OpenAI (GPT-4), Anthropic Claude, OpenAI GPT models
Text Retrieval: Vector-based (with embedding models) or BM25-based (fallback without embeddings)
Databases: Presto, Trino, MySQL with SQLAlchemy support
Code Execution: Local Python, RestrictedPython, Docker containerization
Development: Python 3.11+, with modern tooling (Black, Ruff, MyPy, Pytest)
Storage: SQLite for conversation checkpointing, file system catalog storage

Agent Graph

Text2SQL Graph

Project Structure

openchatbi/
├── README.md                    # Project documentation
├── pyproject.toml               # Modern Python project configuration
├── Dockerfile.python-executor  # Docker image for isolated code execution
├── run_tests.py                # Test runner script
├── run_streamlit_ui.py         # Streamlit UI launcher
├── openchatbi/                 # Core application code
│   ├── __init__.py             # Package initialization
│   ├── config.yaml.template    # Configuration template
│   ├── config_loader.py        # Configuration management
│   ├── constants.py            # Application constants
│   ├── agent_graph.py          # Main LangGraph workflow
│   ├── graph_state.py          # State definition for workflows
│   ├── context_config.py       # Context management configuration
│   ├── context_manager.py      # Context window and token management
│   ├── text_segmenter.py       # Text segmentation with jieba support
│   ├── utils.py                # Utility functions and SimpleStore (BM25-based retrieval)
│   ├── catalog/                # Data catalog management
│   │   ├── __init__.py         # Package initialization
│   │   ├── catalog_loader.py   # Catalog loading logic
│   │   ├── catalog_store.py    # Catalog storage interface
│   │   ├── factory.py          # Catalog factory patterns
│   │   ├── helper.py           # Catalog helper functions
│   │   ├── retrival_helper.py  # Retrieval helper utilities
│   │   ├── schema_retrival.py  # Schema retrieval logic
│   │   ├── token_service.py    # Token service integration
│   │   └── store/              # Catalog storage implementations
│   │       └── file_system.py  # File system-based catalog storage
│   ├── code/                   # Code execution framework
│   │   ├── __init__.py         # Package initialization
│   │   ├── executor_base.py    # Base executor interface
│   │   ├── local_executor.py   # Local Python execution
│   │   ├── restricted_local_executor.py # RestrictedPython execution
│   │   └── docker_executor.py  # Docker-based isolated execution
│   ├── llm/                    # LLM integration layer
│   │   ├── __init__.py         # Package initialization
│   │   └── llm.py              # LLM management and retry logic
│   ├── prompts/                # Prompt templates and engineering
│   │   ├── __init__.py         # Package initialization
│   │   ├── agent_prompt.md     # Main agent prompts
│   │   ├── extraction_prompt.md # Information extraction prompts
│   │   ├── system_prompt.py    # System prompt management
│   │   ├── summary_prompt.md   # Summary conversation prompts
│   │   ├── table_selection_prompt.md # Table selection prompts
│   │   ├── text2sql_prompt.md  # Text-to-SQL prompts
│   │   └── sql_dialect/        # SQL dialect-specific prompts
│   ├── text2sql/               # Text-to-SQL conversion pipeline
│   │   ├── __init__.py         # Package initialization
│   │   ├── data.py             # Data and retriever for Text-to-SQL
│   │   ├── extraction.py       # Information extraction
│   │   ├── generate_sql.py     # SQL generation and execution logic
│   │   ├── schema_linking.py   # Schema linking process
│   │   ├── sql_graph.py        # SQL generation LangGraph workflow
│   │   ├── text2sql_utils.py   # Text2SQL utilities
│   │   └── visualization.py    # Data visualization functions
│   └── tool/                   # LangGraph tools and functions
│       ├── ask_human.py        # Human-in-the-loop interactions
│       ├── memory.py           # Memory management tool
│       ├── mcp_tools.py        # MCP (Model Context Protocol) integration
│       ├── run_python_code.py  # Configurable Python code execution
│       ├── save_report.py      # Report saving functionality
│       ├── search_knowledge.py # Knowledge base search
│       └── timeseries_forecast.py # Time series forecasting tool
├── sample_api/                 # API implementations
│   └── async_api.py            # Asynchronous FastAPI example
├── sample_ui/                  # Web interface implementations
│   ├── memory_ui.py            # Memory-enhanced UI interface
│   ├── plotly_utils.py         # Plotly utilities and helpers
│   ├── simple_ui.py            # Simple non-streaming Gradio UI
│   ├── streaming_ui.py         # Streaming Gradio UI with real-time updates
│   ├── streamlit_ui.py         # Streaming Streamlit UI with enhanced features
│   └── style.py                # UI styling and CSS
├── example/                    # Example configurations and data
│   ├── bi.yaml                 # BI configuration example
│   ├── config.yaml             # Application config example
│   ├── table_info.yaml         # Table information
│   ├── table_columns.csv       # Table column registry
│   ├── common_columns.csv      # Common column definitions
│   ├── sql_example.yaml        # SQL examples for retrieval
│   ├── table_selection_example.csv # Table selection examples
│   └── tracking_orders.sqlite  # Sample SQLite database
├── timeseries_forecasting/     # Time series forecasting service
│   ├── README.md               # Forecasting service documentation
│   └── ...                     # Forecasting service implementation
├── tests/                      # Test suite
│   ├── __init__.py             # Package initialization
│   ├── conftest.py             # Test configuration
│   ├── test_*.py               # Test modules for various components
│   └── README.md               # Testing documentation
├── docs/                       # Documentation
│   ├── source/                 # Sphinx documentation source
│   ├── build/                  # Built documentation
│   ├── Makefile                # Documentation build scripts
│   └── make.bat                # Windows build script
└── .github/                    # GitHub workflows and templates
    └── workflows/              # CI/CD workflows

Advanced Features

Visualization configuration

You can choose rule-based or llm-based visualization or disable visualization.

# Options: "rule" (rule-based), "llm" (LLM-based), or null (skip visualization)
visualization_mode: llm

Prompt Engineering

Basic Knowledge & Glossary

You can define basic knowledge and glossary in example/bi.yaml, for example:

basic_knowledge_glossary: |
  # Basic Knowledge Introduction
    The basic knowledge about your company and its business, including key concepts, metrics, and processes.
  # Glossary
    Common terms and their definitions used in your business context.

Data Warehouse Introduction

You can provide a brief introduction of your data warehouse in example/bi.yaml, for example:

data_warehouse_introduction: |
  # Data Warehouse Introduction
    This data warehouse is built on Presto and contains various tables related to XXXXX.
    The main fact tables include XXXX metrics, while dimension tables include XXXXX.
    The data is updated hourly and is used for reporting and analysis purposes.

Table Selection Rules

You can configure table selection rules in example/bi.yaml, for example:

table_selection_extra_rule: |
  - All tables with is_valid can support both valid and invalid traffics

Custom SQL Rules

You can define your additional SQL Generation rules for tables in example/table_info.yaml, for example:

sql_rule: |
  ### SQL Rules
  - All event_date in the table are stored in **UTC**. If the user specifies a timezone (e.g., CET, PST), convert between timezones accordingly.

Catalog Management

Introduction

High-quality catalog data is essential for accurate Text2SQL generation and data analysis. OpenChatBI automatically discovers and indexes data warehouse table structures while providing flexible management for business metadata, column descriptions, and query optimization rules.

Catalog Structure

The catalog system organizes metadata in a hierarchical structure:

Database Level

Top-level container for all tables and schemas

Table Level

description: Business functionality and purpose of the table
selection_rule: Guidelines for when and how to use this table in queries
sql_rule: Specific SQL generation rules and constraints for this table

Column Level

Required Fields: Essential metadata for each column to enable effective Text2SQL generation
- column_name: Technical database column name
- display_name: Human-readable name for business users
- alias: Alternative names or abbreviations
- type: Data type (string, integer, date, etc.)
- category: Business category, dimension or metric
- tag: Additional labels for filtering and organization
- description: Detailed explanation of column purpose and usage
Two Types of Columns
- Common Columns: Columns with standardized business meanings shared across tables
- Table-Specific Columns: Columns with context-dependent meanings that vary between tables
Derived Metrics: Virtual metrics calculated from existing columns using SQL formulas
- Computed dynamically during query execution rather than stored as physical columns
- Examples: CTR (clicks/impressions), conversion rates, profit margins
- Enable complex business calculations without pre-computing values

Loading Catalog from Database

OpenChatBI can automatically discover and load table structures from your data warehouse:

Automatic Discovery: Connects to your configured data warehouse and scans table schemas
Metadata Extraction: Extracts column names, data types, and basic structural information
Incremental Updates: Supports updating catalog data as your database schema evolves

Configure automatic catalog loading in your config.yaml:

catalog_store:
  store_type: file_system
  data_path: ./catalog_data
data_warehouse_config:
  include_tables:
    - your_table_pattern
  # Leave empty to include all accessible tables

File System Catalog Store

The file system catalog store organizes metadata across multiple files for maintainability and version control:

Core Table Information

table_info.yaml: Comprehensive table metadata organized hierarchically (database → table → information)
- type: Table classification (e.g., “fact” for Fact Tables, “dimension” for Dimension Tables)
- description: Business functionality and purpose
- selection_rule: Usage guidelines in markdown list format (each line starts with -)
- sql_rule: SQL generation rules in markdown header format (each rule starts with ####)
- derived_metric: Virtual metrics with calculation formulas, organized by groups:
```
#### Derived Ratio Metrics
Click-through Rate (alias CTR): SUM(clicks) / SUM(impression)
Conversion Rate (alias CVR): SUM(conversions) / SUM(clicks)
```

Column Management

table_columns.csv: Basic column registry with schema db_name,table_name,column_name
table_spec_columns.csv: Table-specific column metadata with full schema: db_name,table_name,column_name,display_name,alias,type,category,tag,description
common_columns.csv: Shared column definitions across tables with schema: column_name,display_name,alias,type,category,tag,description

Query Examples and Training Data

table_selection_example.csv: Table selection training examples with schema question,selected_tables

sql_example.yaml: Query examples organized by database and table structure:

your_database:
  ad_performance: |
    Q: Show me CTR trends for the past 7 days
    A: SELECT date, SUM(clicks)/SUM(impressions) AS ctr
       FROM ad_performance
       WHERE date >= CURRENT_DATE - INTERVAL 7 DAY
       GROUP BY date
       ORDER BY date;

Time Series Forecasting Service Setup

OpenChatBI can integrate with a time series forecasting service for advanced predictive analytics. Follow these steps to set up the service:

1. Build and Run the Forecasting Service

See detailed instructions in timeseries_forecasting/README.md

Quick start:

cd timeseries_forecasting
./build_and_run.sh

2. Configure Tool Usage Rules

In your bi.yaml, add constraints for the timeseries_forecast tool, e.g. if you are using timer-base-84m model:

extra_tool_use_rule: |
  - timeseries_forecast tool requires at least 96 time points in input data. If no enough input data, set input_len to 96 to pad with zeros.

3. Configure Service URL

In your config.yaml:

# Time Series Forecasting Service Configuration
timeseries_forecasting_service_url: "http://localhost:8765"

Important: Adjust the URL based on your deployment scenario:

Local development (OpenChatBI on host, Forecasting service in Docker): http://localhost:8765
Remote service: http://your-service-host:8765

4. Verify Service Health

Test the service is accessible:

curl http://localhost:8765/health

Expected response:

{
  "status": "healthy",
  "model_initialized": true,
  "uptime_seconds": 123.45
}

Python Code Execution Configuration

OpenChatBI supports multiple execution environments for running Python code with different security and performance characteristics:

# Python Code Execution Configuration
python_executor: local  # Options: "local", "restricted_local", "docker"

Executor Types

local (Default)
- Performance: Fastest execution
- Security: Least secure (code runs in current Python process)
- Capabilities: Full Python capabilities and library access
- Use Case: Development environments, trusted code execution
restricted_local
- Performance: Moderate execution speed
- Security: Moderate security with RestrictedPython sandboxing
- Capabilities: Limited Python features (no imports, file access, etc.)
- Use Case: Semi-trusted environments with controlled execution
docker
- Performance: Slower due to container overhead
- Security: Highest security with complete process isolation
- Capabilities: Full Python capabilities within isolated container
- Use Case: Production environments, untrusted code execution
- Requirements: Docker must be installed and running

Docker Executor Setup

For production deployments or when running untrusted code, the Docker executor provides complete isolation:

Install Docker: Download and install Docker Desktop or Docker Engine
Configure executor: Set python_executor: docker in your config
Automatic setup: OpenChatBI will automatically build the required Docker image
Fallback behavior: If Docker is unavailable, automatically falls back to local executor

Docker Executor Features:

Pre-installed data science libraries (pandas, numpy, matplotlib, seaborn)
Network isolation for security
Automatic container cleanup
Resource isolation from host system

Development & Testing

Code Quality Tools

The project uses modern Python tooling for code quality:

# Format code
uv run black .

# Lint code
uv run ruff check .

# Type checking
uv run mypy openchatbi/

# Security scanning
uv run bandit -r openchatbi/

Testing

Run the test suite:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=openchatbi --cov-report=html

# Run specific test files
uv run pytest test/test_generate_sql.py
uv run pytest test/test_agent_graph.py

Pre-commit Hooks

Install pre-commit hooks for automatic code quality checks:

uv run pre-commit install

Contribution Guidelines

Fork the repository
Create a feature branch (git checkout -b feature/fooBar)
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details

Contact & Support

Author: Yu Zhong (zhongyu8@gmail.com)
Repository: github.com/zhongyu09/openchatbi
Issues: Report bugs and feature requests

Documentation:

OpenChatBI Documentation