
Why “Talking to Data” Is a Real Business Opportunity
Picture this scenario: you’re talking to an e-commerce operations director about a potential partnership. He says, “We check sales data across 5 dimensions every day, but each time we have to wait for the data analyst to schedule a report.”
You reply: “Let me build you a system where you can just type ‘Which three products had the highest return rate in East China last week?’ into a chat box, and the system gives you the answer automatically.”
This is the core value of an AI Data Q&A bot — letting non-technical business users get data insights through natural language.
The traditional approach? Build a React frontend → write a FastAPI backend → integrate an LLM API → translate natural language to SQL → execute the query → return results. That’s at least 2,000 lines of code.
With DuckDB, the entire core logic is under 150 lines. Why? Because DuckDB is an in-process database — no separate service to deploy. Just import duckdb in Python, and it’s ready to use. It naturally fits into the reasoning loop of an AI Agent.
System Architecture: Four Steps from Natural Language to Data Results
User Question (natural language) → LLM generates SQL → DuckDB executes → Formatted result returned
↑ ↓
└──── Result feedback to LLM generates natural answer ←──────┘
The entire flow consists of four core steps:
- Semantic Understanding: LLM translates natural language into SQL queries
- Security Sandbox: Restrict DuckDB to only read designated tables, preventing malicious queries
- Result Execution: DuckDB efficiently executes queries and returns structured data
- Natural Language Reply: LLM converts data results into easy-to-understand business language
Complete Runnable Code
Step 1: Prepare Data and DuckDB Tables
import duckdb
import json
# Create an in-memory DuckDB database
con = duckdb.connect(':memory:')
# Simulate e-commerce order data
con.execute("""
CREATE TABLE orders AS
SELECT * FROM read_csv_auto('https://github.com/duckdb/duckdb-data/raw/main/data/orders.csv');
""")
# View table structure
print(con.execute("DESCRIBE orders").fetchdf())
If your data is in local CSV/Parquet files, just read them directly:
# Read local CSV (auto-infer schema)
con.read_csv_auto('/path/to/sales_data.csv')
# Read Parquet files (columnar storage, faster)
con.read_parquet('/path/to/sales_data.parquet')
# Batch-read all Parquet files in a directory
con.read_parquet('/path/to/data/*.parquet')
Step 2: Build the SQL Generator
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
def get_table_schema(con):
"""Get schema information for all tables in DuckDB"""
tables = con.execute("SHOW TABLES").fetchall()
schema_info = {}
for table_name, in tables:
columns = con.execute(f"DESCRIBE {table_name}").fetchall()
schema_info[table_name] = [
{'column': col[0], 'type': col[1]} for col in columns
]
return schema_info
def generate_sql(user_query, schema_info):
"""Translate natural language to SQL"""
schema_str = json.dumps(schema_info, indent=2, ensure_ascii=False)
messages = [
{
"role": "system",
"content": """You are a SQL expert. The user will ask questions in natural language.
You need to convert them into DuckDB SQL queries.
Important rules:
1. Only use the tables listed below — do not create new tables
2. Limit results to 100 rows
3. For numerical calculations, use ROUND to 2 decimal places
4. If the question cannot be answered with existing data, return an empty string
5. Output only the SQL, no explanatory text"""
},
{
"role": "user",
"content": f"""Database table schema:
{schema_str}
User question: {user_query}
Please generate the SQL query (output only the SQL statement):"""
}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.1,
max_tokens=500
)
sql = response.choices[0].message.content.strip()
# Clean possible markdown code block markers
if sql.startswith("```"):
sql = sql.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return sql
Step 3: Secure Execution Engine
class SafeQueryEngine:
"""Secure DuckDB query executor"""
def __init__(self, connection):
self.con = connection
def execute_query(self, sql, max_rows=100):
"""Execute SQL and return results"""
# Security check: only allow SELECT
sql_upper = sql.strip().upper()
if not sql_upper.startswith("SELECT"):
raise PermissionError("Only SELECT queries are allowed")
# Check for dangerous functions
dangerous_keywords = ['DROP', 'DELETE', 'INSERT', 'UPDATE', 'ALTER', 'CREATE', 'EXEC']
for keyword in dangerous_keywords:
if keyword in sql_upper:
raise PermissionError(f"Queries containing {keyword} are not allowed")
try:
result = self.con.execute(sql)
df = result.fetchdf()
# Limit rows
if len(df) > max_rows:
df = df.head(max_rows)
return {
'success': True,
'data': df.to_dict(orient='records'),
'columns': list(df.columns),
'row_count': len(df)
}
except Exception as e:
return {
'success': False,
'error': str(e)
}
Step 4: Generate Natural Language Answers
def generate_answer(user_query, query_result, schema_info):
"""Convert query results into natural language answers"""
if not query_result['success']:
return f"Sorry, the query failed: {query_result['error']}. Please try rephrasing your question."
if query_result['row_count'] == 0:
return "Based on the current data, no matching results were found."
# Build a brief data summary
data_summary = json.dumps(query_result['data'][:5], indent=2, ensure_ascii=False)
messages = [
{
"role": "system",
"content": """You are a data analyst. The user asked a question in natural language.
You have executed the SQL query and received results.
Convert the results into an easy-to-understand answer. Requirements:
1. Summarize the core finding (1-2 sentences)
2. List key data points
3. Provide brief business recommendations
4. If the dataset is large, show only the first few rows"""
},
{
"role": "user",
"content": f"""User question: {user_query}
Query results (first 5 rows):
{data_summary}
Total: {query_result['row_count']} records
Please answer the user's question in natural language:"""
}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.3,
max_tokens=1000
)
return response.choices[0].message.content
Step 5: Complete Conversation Loop
def chat_loop():
"""Interactive conversation loop"""
print("=" * 60)
print("🤖 DuckDB Intelligent Data Q&A Assistant")
print("=" * 60)
print("Type your question (type 'quit' to exit)")
print()
schema_info = get_table_schema(con)
engine = SafeQueryEngine(con)
while True:
user_query = input("💬 You: ").strip()
if user_query.lower() in ('quit', 'exit', 'q'):
print("Goodbye! 👋")
break
# Step 1: Generate SQL
print("\n⏳ Generating query...")
sql = generate_sql(user_query, schema_info)
print(f" SQL: {sql}")
if not sql:
print("❌ Unable to understand your question. Please rephrase.")
continue
# Step 2: Execute query
print("⏳ Executing query...")
query_result = engine.execute_query(sql)
if not query_result['success']:
print(f"❌ {query_result['error']}")
continue
# Step 3: Generate natural language answer
print("⏳ Generating answer...")
answer = generate_answer(user_query, query_result, schema_info)
print(f"\n🤖 Assistant: {answer}")
print()
print("-" * 60)
# Start the chat
if __name__ == "__main__":
chat_loop()
Advanced: Wrap It as a Web Service
If you want to expose this capability to multiple users, just add a FastAPI layer:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="DuckDB Data Q&A API")
class QueryRequest(BaseModel):
question: str
@app.post("/ask")
def ask_question(req: QueryRequest):
sql = generate_sql(req.question, schema_info)
result = engine.execute_query(sql)
answer = generate_answer(req.question, result, schema_info)
return {"answer": answer}
Once started, any frontend (web page, WeChat mini-program, Feishu bot) can call this API.
Monetization Guide: How to Sell This System?
Scenario 1: SME Internal Data Dashboard
Target customers: E-commerce sellers with 100–1,000 daily orders, chain restaurant brands.
Pitch: “Let store managers check data themselves without waiting for analysts.”
Pricing: One-time delivery $1,100–2,100 + monthly maintenance $70–140.
Delivery timeline: 1–2 days (after data is prepared).
Scenario 2: Financial Research Automation
Target customers: Junior analysts at securities firms and fund companies.
Pitch: “Let AI handle repetitive data work so analysts focus on judgment and logic.”
Pricing: Enterprise license $4,300–11,000/year.
Scenario 3: SaaS Data Q&A Product
Pitch: For SMEs without data teams, charge per seat.
Pricing: $14–42/month/user.
Key Optimization Tips
1. Cache Frequent Queries
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_query(sql_hash, sql_result_json):
pass
2. Use DuckDB’s Parallel Execution
# Enable multi-threading for acceleration
con.execute("SET threads TO 4")
3. Multi-Step Execution for Complex Queries
For complex multi-table JOIN queries, split them into simpler queries to reduce LLM error probability:
def complex_query_pipeline(user_query, schema_info):
"""Multi-step query decomposition"""
# Step 1: Let LLM decide which tables and fields are needed
plan = generate_query_plan(user_query, schema_info)
# Step 2: Execute each sub-query step by step
# Step 3: Combine results into final answer
...
Summary
The core advantages of building an AI data Q&A system with DuckDB:
- Zero Operations: In-process database, no separate service to deploy
- High Performance: Vectorized execution, millisecond response for millions of rows
- Easy to Embed: Native bindings for Python/R/Node.js
- Format Compatible: Directly reads CSV/Parquet/JSON/Excel without data cleaning
You can build an MVP of this system in one afternoon, then take it to meet clients. Turning data into a product has never been about technology — it’s about “can you translate data into language others understand?”
📖 The complete version of this article, including detailed troubleshooting guides and deployment templates, is published at duckdblab.org. Want to systematically learn DuckDB? duckdblab.org has a complete tutorial series covering everything from beginner to enterprise-level deployment.
💡 More DuckDB advanced techniques → duckdblab.org
Olap Studio · Focused on DuckDB Practical Skills · 2026-06-18