Featured image of post Build an AI-Powered Data Q&A Bot with DuckDB: Let Business Users Query Data with Natural Language

Build an AI-Powered Data Q&A Bot with DuckDB: Let Business Users Query Data with Natural Language

Build a Text-to-SQL data Q&A bot using DuckDB + LLM. Non-technical users can query databases through natural language. Includes full runnable code, security engine, and three monetization scenarios.

Build an AI-Powered Data Q&A Bot with DuckDB: Let Business Users Query Data with Natural Language

πŸ’° Monetization Path: Build internal data Q&A systems for enterprises, charging $700-$2,800 per deployment


1. The Business Pain Point That Is Your Opportunity

Have you ever had this scenario: you’re talking to an e-commerce operations director, and they say, “We need to check sales data across 5 dimensions every day, but we always have to wait for the data analyst to schedule queries.”

You reply: “I’ll build you a system where you can just type ‘Which three products in East China had the highest return rate last week?’ and the system gives you the answer automatically.”

This is the core value of an AI data Q&A bot: letting non-technical business users get data insights through natural language.

What does the traditional approach look like? Frontend in React β†’ Backend in FastAPI β†’ Connect to LLM API β†’ Convert natural language to SQL β†’ Execute query β†’ Return results. This workflow requires at least 2000 lines of code.

With DuckDB, the entire core logic is under 150 lines. The reason is simple: DuckDB is an in-process database that requires no separate deployment. In Python, you just import duckdb, making it a perfect fit for embedding into an AI Agent’s inference pipeline.


2. System Architecture: Four Steps from Natural Language to Data Results

The core pipeline is remarkably straightforward:

User Question (Natural Language) β†’ LLM Generates SQL β†’ DuckDB Executes Query β†’ Format Results
       ↑                                                                  ↓
       └────── Result fed back to LLM for Natural Language Answer β†β”€β”€β”€β”€β”€β”€β”˜

Four key modules:

  1. Schema Awareness: Let the LLM understand the database structure
  2. SQL Generator: Convert natural language to DuckDB SQL
  3. Safe Execution Engine: Prevent malicious queries and protect data security
  4. Natural Language Answerer: Transform query results into business-readable analysis

3. Step One: Prepare the Data

DuckDB’s strength lies in its ability to directly read various file formats without any import process.

import duckdb
import json

# Create an in-memory DuckDB database (zero configuration)
con = duckdb.connect(':memory:')

# Directly read CSV with automatic schema inference
con.execute("""
CREATE TABLE orders AS
SELECT * FROM read_csv_auto('https://github.com/duckdb/duckdb-data/raw/main/data/orders.csv');
""")

# Inspect table structure and row count
print(con.execute("DESCRIBE orders").fetchall())
print(con.execute("SELECT COUNT(*) FROM orders").fetchone())

# Also supports Parquet, JSON, Excel, and many other formats
# con.read_parquet('/path/to/data/*.parquet')
# con.execute("CREATE TABLE events AS SELECT * FROM read_json_auto('data.json')")

Key advantages:

  • read_csv_auto automatically infers column types without manual specification
  • Supports glob pattern matching to read multiple files at once
  • In-memory database releases resources when the connection closes, ideal for multi-tenant SaaS scenarios

4. Step Two: Build the SQL Generator

This is the core of the entire system: letting the LLM understand database structure and generate correct SQL.

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

def get_table_schema(connection):
    """Function to get database table structure"""
    tables = connection.execute(
        "SELECT table_name FROM information_schema.tables WHERE table_schema = 'main'"
    ).fetchall()
    
    schema_info = {}
    for (table_name,) in tables:
        columns = connection.execute(
            f"DESCRIBE {table_name}"
        ).fetchall()
        schema_info[table_name] = [
            {"column": col[0], "type": col[1]} for col in columns
        ]
    return schema_info

def generate_sql(user_query, schema_info):
    schema_str = json.dumps(schema_info, indent=2, ensure_ascii=False)
    
    messages = [
        {
            "role": "system",
            "content": """You are a SQL expert. Users will ask questions in natural language, and you need to convert them into DuckDB SQL queries.
Important rules:
1. Only use the provided tables, do not create new ones
2. Limit query results to 100 rows
3. Use ROUND to keep 2 decimal places for numerical calculations
4. Output only SQL, no explanatory text"""
        },
        {
            "role": "user",
            "content": f"""Database table structure:
{schema_str}

User question: {user_query}

Please generate the SQL query:"""
        }
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.1,  # Low temperature for stability
        max_tokens=500
    )
    
    sql = response.choices[0].message.content.strip()
    # Clean markdown code block markers
    if sql.startswith("```"):
        sql = sql.split("\n", 1)[1].rsplit("```", 1)[0].strip()
    return sql

Why gpt-4o-mini?

  • Low cost: approximately $0.6 per million tokens, less than a cent per Q&A request
  • Fast: response time typically under 500ms
  • Smart enough: Text-to-SQL tasks don’t require the strongest model

Advanced tip: If you have many data tables, use RAG to retrieve only relevant table structures for the LLM, reducing prompt length and cost.


5. Step Three: Safe Execution Engine

This is the easiest to overlook but most critical component. SQL generated by LLMs may contain malicious content, so strict security validation is essential.

class SafeQueryEngine:
    """Secure DuckDB query executor"""
    
    def __init__(self, connection):
        self.con = connection
    
    def execute_query(self, sql, max_rows=100):
        # Security check: only allow SELECT
        sql_upper = sql.strip().upper()
        if not sql_upper.startswith("SELECT"):
            raise PermissionError("Only SELECT queries are allowed")
        
        dangerous_keywords = ['DROP', 'DELETE', 'INSERT', 'UPDATE', 'ALTER', 'CREATE', 'EXEC']
        for keyword in dangerous_keywords:
            if keyword in sql_upper:
                raise PermissionError(f"Queries containing {keyword} are not allowed")
        
        try:
            result = self.con.execute(sql)
            df = result.fetchdf()
            if len(df) > max_rows:
                df = df.head(max_rows)
            return {
                'success': True,
                'data': df.to_dict(orient='records'),
                'columns': list(df.columns),
                'row_count': len(df)
            }
        except Exception as e:
            return {'success': False, 'error': str(e)}

Security strategy summary:

Risk TypeProtection MethodDescription
Write operationsAllow only SELECT prefixBlocks INSERT/UPDATE/DELETE
Schema modificationBlacklist keyword filteringBlocks DROP/ALTER/CREATE
Resource exhaustionLimit returned rowsPrevents large queries from overwhelming the system
SQL injectionParameterized queries + whitelistDuckDB doesn’t natively support parameterization, requires additional handling

6. Step Four: Generate Natural Language Answers

After the query results are returned, the LLM needs to transform them into text that business users can understand.

def generate_answer(user_query, query_result, schema_info):
    if not query_result['success']:
        return f"Sorry, the query failed: {query_result['error']}. Please try rephrasing your question."
    
    if query_result['row_count'] == 0:
        return "Based on the current data, no matching results were found."
    
    # Take only the first 5 rows as context to save tokens
    data_summary = json.dumps(query_result['data'][:5], indent=2, ensure_ascii=False)
    
    messages = [
        {
            "role": "system",
            "content": """You are a data analyst. Convert the query results into a Chinese answer that is easy for users to understand.
1. Summarize the core finding (1-2 sentences)
2. List key data points
3. Provide a brief business recommendation"""
        },
        {
            "role": "user",
            "content": f"""User question: {user_query}

Query results (first 5 rows):
{data_summary}

Total records: {query_result['row_count']}

Please answer the user's question in natural language:"""
        }
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.3,  # Slightly higher temperature for more natural responses
        max_tokens=1000
    )
    
    return response.choices[0].message.content

7. Complete Interactive Loop

Assembling all modules together creates a functional data Q&A system:

def chat_loop():
    print("=" * 60)
    print("πŸ€– DuckDB Smart Data Q&A Assistant")
    print("=" * 60)
    
    schema_info = get_table_schema(con)
    engine = SafeQueryEngine(con)
    
    while True:
        user_query = input("πŸ’¬ You: ").strip()
        if user_query.lower() in ('quit', 'exit', 'q'):
            break
        
        # 1. Generate SQL
        sql = generate_sql(user_query, schema_info)
        print(f"πŸ“ Generated SQL: {sql}")
        
        # 2. Execute query
        query_result = engine.execute_query(sql)
        
        # 3. Generate answer
        answer = generate_answer(user_query, query_result, schema_info)
        print(f"\nπŸ€– Assistant: {answer}")
        print("-" * 60)

# Start the loop
chat_loop()

Actual conversation example:

πŸ’¬ You: Which country had the most orders last month?
πŸ“ Generated SQL: SELECT ship_country, COUNT(*) as order_count 
                   FROM orders 
                   WHERE order_date >= DATE '2024-05-01' 
                     AND order_date < DATE '2024-06-01' 
                   GROUP BY ship_country 
                   ORDER BY order_count DESC LIMIT 10;
πŸ€– Assistant: Last month, China had the most orders with 127 transactions. The US followed with 98 orders, and Japan had 76. Consider increasing marketing investment in the Chinese market.

8. Advanced: Wrap as a Web Service

Wrap the above system as a FastAPI service, and any frontend can integrate with it:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="DuckDB Data Q&A API")

class QueryRequest(BaseModel):
    question: str

@app.post("/ask")
def ask_question(req: QueryRequest):
    sql = generate_sql(req.question, schema_info)
    result = engine.execute_query(sql)
    answer = generate_answer(req.question, result, schema_info)
    return {"answer": answer}

Once started, any frontend (web page, WeChat Mini Program, Feishu Bot, Slack Bot) can call this API. This means you develop the backend once and cover multiple channels.


9. Key Optimization Techniques

1. Enable Parallel Execution

con.execute("SET threads TO 4")  # Fully utilize multi-core CPUs

2. Cache Frequent Queries

from functools import lru_cache

@lru_cache(maxsize=128)
def cached_ask(question_hash: str, question: str, schema_hash: str):
    """Cache answers for identical questions to avoid repeated LLM calls"""
    sql = generate_sql(question, schema_info)
    result = engine.execute_query(sql)
    return generate_answer(question, result, schema_info)

3. Split Complex Queries into Steps

When joining multiple tables, LLMs tend to make mistakes. Break complex queries into simpler steps:

  • Use simple queries to fetch base data
  • Use SQL aggregation for calculations
  • Have the LLM analyze the final results

4. Add Error Retry Mechanism

def robust_generate_sql(user_query, schema_info, max_retries=3):
    for attempt in range(max_retries):
        try:
            sql = generate_sql(user_query, schema_info)
            # Try executing to verify validity
            test_result = engine.execute_query(sql)
            if test_result['success']:
                return sql
        except:
            pass
    return None

10. Comparison with Traditional Approaches

DimensionTraditional (React + FastAPI + PostgreSQL)DuckDB Approach
Deployment ComplexityRequires 3 services (frontend, backend, DB)One Python process suffices
Lines of Code2000+150 core logic lines
Data LoadingRequires ETL import firstDirect CSV/Parquet reading
Startup TimeMinutesMilliseconds
Best Use CaseLarge-scale productionSmall-medium scale, embedded, prototyping

DuckDB’s core advantage: Embedded, zero-configuration, direct file reading. For data Q&A bot scenarios like this, there is no need for an independent database server, significantly reducing deployment and maintenance costs.


11. Monetization Guide: Three Scenarios, Three Pricing Models

Scenario 1: SMB Internal Data Dashboard

  • Target customers: E-commerce sellers with 100-1000 daily orders, chain restaurants
  • Pitch: “Let store managers check data themselves without waiting for analysts.”
  • Pricing: One-time delivery $1,100-$2,100 + monthly maintenance $70-$140
  • Delivery cycle: 1-2 days (when data volume is small)

Scenario 2: Financial Research Automation

  • Target customers: Securities research institutes, junior fund analysts
  • Pitch: “Let AI handle repetitive data organization so analysts focus on logic.”
  • Pricing: Enterprise license $4,200-$11,200/year
  • Delivery cycle: 1-2 weeks (requires integrating multiple data sources)

Scenario 3: SaaS Data Q&A Product

  • Pitch: Target SMEs without data teams, charge per seat
  • Pricing: $14-$41/month/user
  • Delivery cycle: Continuous iteration, MVP in one week

12. Summary

You can build an MVP of this system in one afternoon and then take it to clients for negotiations. Data productization has never been a technical problemβ€”it is about “can you turn data into language that others understand?”

DuckDB plays a crucial role in this scenario: it lets you skip database deployment and maintenance so you can focus on the real business questionβ€”how to help business users get better data insights.

When your client asks “Why DuckDB instead of MySQL?”, you can confidently answer: because your system should work like a mobile app out of the box, not like installing a printer driver after hours of troubleshooting.


πŸ“– The full version of this article is published on duckdblab.org, including detailed steps, troubleshooting guides, and deployment templates. Want to systematically learn DuckDB? duckdblab.org has a complete tutorial series covering everything from beginner to enterprise-level deployment.

πŸ“Ί Watch video tutorials β†’ DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy