Build an AI Data Q&A Bot with DuckDB — Let Non-Technical Users 'Talk to Their Data'

Build an AI Data Q&A Bot with DuckDB — Let Non-Technical Users 'Talk to Their Data'

Build a fully functional AI-powered data Q&A system with DuckDB. Turn natural language questions into SQL queries, execute them safely, and get instant business insights — all in under 150 lines of Python.

Why “Talking to Data” Is a Real Business Opportunity

Picture this scenario: you’re talking to an e-commerce operations director about a potential partnership. He says, “We check sales data across 5 dimensions every day, but each time we have to wait for the data analyst to schedule a report.”

You reply: “Let me build you a system where you can just type ‘Which three products had the highest return rate in East China last week?’ into a chat box, and the system gives you the answer automatically.”

This is the core value of an AI Data Q&A bot — letting non-technical business users get data insights through natural language.

The traditional approach? Build a React frontend → write a FastAPI backend → integrate an LLM API → translate natural language to SQL → execute the query → return results. That’s at least 2,000 lines of code.

With DuckDB, the entire core logic is under 150 lines. Why? Because DuckDB is an in-process database — no separate service to deploy. Just import duckdb in Python, and it’s ready to use. It naturally fits into the reasoning loop of an AI Agent.

System Architecture: Four Steps from Natural Language to Data Results

User Question (natural language) → LLM generates SQL → DuckDB executes → Formatted result returned
       ↑                                                              ↓
       └──── Result feedback to LLM generates natural answer ←──────┘

The entire flow consists of four core steps:

Semantic Understanding: LLM translates natural language into SQL queries
Security Sandbox: Restrict DuckDB to only read designated tables, preventing malicious queries
Result Execution: DuckDB efficiently executes queries and returns structured data
Natural Language Reply: LLM converts data results into easy-to-understand business language

Complete Runnable Code

Step 1: Prepare Data and DuckDB Tables

import duckdb
import json

# Create an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Simulate e-commerce order data
con.execute("""
CREATE TABLE orders AS
SELECT * FROM read_csv_auto('https://github.com/duckdb/duckdb-data/raw/main/data/orders.csv');
""")

# View table structure
print(con.execute("DESCRIBE orders").fetchdf())

If your data is in local CSV/Parquet files, just read them directly:

# Read local CSV (auto-infer schema)
con.read_csv_auto('/path/to/sales_data.csv')

# Read Parquet files (columnar storage, faster)
con.read_parquet('/path/to/sales_data.parquet')

# Batch-read all Parquet files in a directory
con.read_parquet('/path/to/data/*.parquet')

Step 2: Build the SQL Generator

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

def get_table_schema(con):
    """Get schema information for all tables in DuckDB"""
    tables = con.execute("SHOW TABLES").fetchall()
    schema_info = {}
    for table_name, in tables:
        columns = con.execute(f"DESCRIBE {table_name}").fetchall()
        schema_info[table_name] = [
            {'column': col[0], 'type': col[1]} for col in columns
        ]
    return schema_info

def generate_sql(user_query, schema_info):
    """Translate natural language to SQL"""
    schema_str = json.dumps(schema_info, indent=2, ensure_ascii=False)
    
    messages = [
        {
            "role": "system",
            "content": """You are a SQL expert. The user will ask questions in natural language. 
            You need to convert them into DuckDB SQL queries.
            
            Important rules:
            1. Only use the tables listed below — do not create new tables
            2. Limit results to 100 rows
            3. For numerical calculations, use ROUND to 2 decimal places
            4. If the question cannot be answered with existing data, return an empty string
            5. Output only the SQL, no explanatory text"""
        },
        {
            "role": "user",
            "content": f"""Database table schema:
{schema_str}

User question: {user_query}

Please generate the SQL query (output only the SQL statement):"""
        }
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.1,
        max_tokens=500
    )
    
    sql = response.choices[0].message.content.strip()
    # Clean possible markdown code block markers
    if sql.startswith("```"):
        sql = sql.split("\n", 1)[1].rsplit("```", 1)[0].strip()
    return sql

Step 3: Secure Execution Engine

class SafeQueryEngine:
    """Secure DuckDB query executor"""
    
    def __init__(self, connection):
        self.con = connection
    
    def execute_query(self, sql, max_rows=100):
        """Execute SQL and return results"""
        # Security check: only allow SELECT
        sql_upper = sql.strip().upper()
        if not sql_upper.startswith("SELECT"):
            raise PermissionError("Only SELECT queries are allowed")
        
        # Check for dangerous functions
        dangerous_keywords = ['DROP', 'DELETE', 'INSERT', 'UPDATE', 'ALTER', 'CREATE', 'EXEC']
        for keyword in dangerous_keywords:
            if keyword in sql_upper:
                raise PermissionError(f"Queries containing {keyword} are not allowed")
        
        try:
            result = self.con.execute(sql)
            df = result.fetchdf()
            
            # Limit rows
            if len(df) > max_rows:
                df = df.head(max_rows)
            
            return {
                'success': True,
                'data': df.to_dict(orient='records'),
                'columns': list(df.columns),
                'row_count': len(df)
            }
        except Exception as e:
            return {
                'success': False,
                'error': str(e)
            }

Step 4: Generate Natural Language Answers

def generate_answer(user_query, query_result, schema_info):
    """Convert query results into natural language answers"""
    if not query_result['success']:
        return f"Sorry, the query failed: {query_result['error']}. Please try rephrasing your question."
    
    if query_result['row_count'] == 0:
        return "Based on the current data, no matching results were found."
    
    # Build a brief data summary
    data_summary = json.dumps(query_result['data'][:5], indent=2, ensure_ascii=False)
    
    messages = [
        {
            "role": "system",
            "content": """You are a data analyst. The user asked a question in natural language. 
            You have executed the SQL query and received results.
            Convert the results into an easy-to-understand answer. Requirements:
            1. Summarize the core finding (1-2 sentences)
            2. List key data points
            3. Provide brief business recommendations
            4. If the dataset is large, show only the first few rows"""
        },
        {
            "role": "user",
            "content": f"""User question: {user_query}

Query results (first 5 rows):
{data_summary}

Total: {query_result['row_count']} records

Please answer the user's question in natural language:"""
        }
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.3,
        max_tokens=1000
    )
    
    return response.choices[0].message.content

Step 5: Complete Conversation Loop

def chat_loop():
    """Interactive conversation loop"""
    print("=" * 60)
    print("🤖 DuckDB Intelligent Data Q&A Assistant")
    print("=" * 60)
    print("Type your question (type 'quit' to exit)")
    print()
    
    schema_info = get_table_schema(con)
    engine = SafeQueryEngine(con)
    
    while True:
        user_query = input("💬 You: ").strip()
        if user_query.lower() in ('quit', 'exit', 'q'):
            print("Goodbye! 👋")
            break
        
        # Step 1: Generate SQL
        print("\n⏳ Generating query...")
        sql = generate_sql(user_query, schema_info)
        print(f"   SQL: {sql}")
        
        if not sql:
            print("❌ Unable to understand your question. Please rephrase.")
            continue
        
        # Step 2: Execute query
        print("⏳ Executing query...")
        query_result = engine.execute_query(sql)
        
        if not query_result['success']:
            print(f"❌ {query_result['error']}")
            continue
        
        # Step 3: Generate natural language answer
        print("⏳ Generating answer...")
        answer = generate_answer(user_query, query_result, schema_info)
        print(f"\n🤖 Assistant: {answer}")
        print()
        print("-" * 60)

# Start the chat
if __name__ == "__main__":
    chat_loop()

Advanced: Wrap It as a Web Service

If you want to expose this capability to multiple users, just add a FastAPI layer:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="DuckDB Data Q&A API")

class QueryRequest(BaseModel):
    question: str

@app.post("/ask")
def ask_question(req: QueryRequest):
    sql = generate_sql(req.question, schema_info)
    result = engine.execute_query(sql)
    answer = generate_answer(req.question, result, schema_info)
    return {"answer": answer}

Once started, any frontend (web page, WeChat mini-program, Feishu bot) can call this API.

Monetization Guide: How to Sell This System?

Scenario 1: SME Internal Data Dashboard

Target customers: E-commerce sellers with 100–1,000 daily orders, chain restaurant brands.

Pitch: “Let store managers check data themselves without waiting for analysts.”

Pricing: One-time delivery $1,100–2,100 + monthly maintenance $70–140.

Delivery timeline: 1–2 days (after data is prepared).

Scenario 2: Financial Research Automation

Target customers: Junior analysts at securities firms and fund companies.

Pitch: “Let AI handle repetitive data work so analysts focus on judgment and logic.”

Pricing: Enterprise license $4,300–11,000/year.

Scenario 3: SaaS Data Q&A Product

Pitch: For SMEs without data teams, charge per seat.

Pricing: $14–42/month/user.

Key Optimization Tips

1. Cache Frequent Queries

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_query(sql_hash, sql_result_json):
    pass

2. Use DuckDB’s Parallel Execution

# Enable multi-threading for acceleration
con.execute("SET threads TO 4")

3. Multi-Step Execution for Complex Queries

For complex multi-table JOIN queries, split them into simpler queries to reduce LLM error probability:

def complex_query_pipeline(user_query, schema_info):
    """Multi-step query decomposition"""
    # Step 1: Let LLM decide which tables and fields are needed
    plan = generate_query_plan(user_query, schema_info)
    # Step 2: Execute each sub-query step by step
    # Step 3: Combine results into final answer
    ...

Summary

The core advantages of building an AI data Q&A system with DuckDB:

Zero Operations: In-process database, no separate service to deploy
High Performance: Vectorized execution, millisecond response for millions of rows
Easy to Embed: Native bindings for Python/R/Node.js
Format Compatible: Directly reads CSV/Parquet/JSON/Excel without data cleaning

You can build an MVP of this system in one afternoon, then take it to meet clients. Turning data into a product has never been about technology — it’s about “can you translate data into language others understand?”

📖 The complete version of this article, including detailed troubleshooting guides and deployment templates, is published at duckdblab.org. Want to systematically learn DuckDB? duckdblab.org has a complete tutorial series covering everything from beginner to enterprise-level deployment.

💡 More DuckDB advanced techniques → duckdblab.org

Olap Studio · Focused on DuckDB Practical Skills · 2026-06-18

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.