Building a Stock Market Analysis Pipeline with DuckDB: From Data Import to Automated Reports
Difficulty: ⭐⭐⭐ | Estimated Setup Time: 2 hours, then 30 seconds/day for automated reports
Why Use DuckDB for Stock Analysis?
Many independent investors, financial advisors, and small investment research teams share a common pain point: they need daily market overviews and portfolio performance data, but manual data organization is slow, commercial analytics tools (Tableau, Power BI) are too heavy, and writing Python + Pandas feels unnecessarily complex.
DuckDB fills this gap perfectly — it offers SQLite-like simplicity with aggregation performance 10x faster than Pandas. All you need is SQL to build a complete pipeline from data import to report generation.
In this guide, you’ll build a daily auto-generated stock portfolio analysis report system that you can:
- Sell to financial advisors who need daily market briefings
- Wrap into an API for small hedge funds as a backend service
- Turn into a paid Newsletter to attract subscribers
System Architecture
The system consists of three core components:
Data Source (CSV/API) → DuckDB (.duckdb) → SQL Analysis → Text/CSV Report → Push (Telegram/Email)
- Data Layer: Import stock data from CSV/API → DuckDB database file
- Analysis Layer: Perform aggregation, ranking, and trend analysis using SQL
- Output Layer: Generate text reports and push them to Telegram or email automatically
Complete Implementation
Step 1: Create Database and Bulk Import Data
import duckdb
# Connect (file mode: persistent data)
conn = duckdb.connect('portfolio.db')
# Create table structure
conn.execute("""
CREATE TABLE IF NOT EXISTS stock_prices (
date DATE,
ticker VARCHAR,
open DOUBLE,
high DOUBLE,
low DOUBLE,
close DOUBLE,
volume BIGINT
);
""")
# Bulk import from CSV (100x faster than INSERT)
conn.execute("""
COPY stock_prices FROM 'stock_data.csv'
(FORMAT CSV, HEADER, DELIMITER ',');
""")
# Create indexes for faster queries
conn.execute("CREATE INDEX IF NOT EXISTS idx_date ON stock_prices(date);")
conn.execute("CREATE INDEX IF NOT EXISTS idx_ticker ON stock_prices(ticker);")
Key Point: COPY FROM bulk import is 100x faster than row-by-row INSERT. For million-row daily K-line datasets, this difference becomes dramatic.
Step 2: Portfolio Overview Analysis
SELECT
ticker,
ROUND(AVG(close), 2) AS avg_price,
ROUND(MAX(close), 2) AS peak_price,
ROUND(MIN(close), 2) AS bottom_price,
ROUND(LAST(close), 2) AS latest_price,
ROUND((LAST(close) - FIRST(close)) / FIRST(close) * 100, 2) AS total_return_pct
FROM stock_prices
GROUP BY ticker
ORDER BY total_return_pct DESC;
Sample Output:
📈 AMZN — Avg $184.91 Peak $191.30 Bottom $179.66 Latest $184.20 Total Return +2.18% 📈 AAPL — Avg $174.97 Peak $182.91 Bottom $168.97 Latest $176.34 Total Return +0.97% 📉 GOOG — Avg $174.81 Peak $182.09 Bottom $167.61 Latest $173.87 Total Return +0.71%
Here we use LAST() and FIRST() aggregate functions, which are DuckDB built-ins that directly retrieve the last and first row values within a group — extremely concise.
Step 3: Volatility Analysis — Finding Market Hotspots
Volatility is one of the most important indicators for professional investors. The stocks moving the most often contain the most opportunity:
SELECT
ticker,
ROUND(STDDEV(close), 2) AS vol_abs,
ROUND(STDDEV(close) / AVG(close) * 100, 2) AS vol_pct,
ROUND(AVG(volume) / 1000000, 1) AS avg_vol_m
FROM stock_prices
GROUP BY ticker
ORDER BY vol_pct DESC;
Sample Output:
- AAPL — Volatility 1.46%, Avg Volume 25.3M ← High volatility + high liquidity = day-trading favorite
- AMZN — Volatility 1.39%, Avg Volume 24.9M
- MSFT — Volatility 0.60%, Avg Volume 25.6M ← Low volatility = defensive allocation
Step 4: Monthly Return Ranking — The Power of Window Functions
WITH monthly AS (
SELECT
ticker,
FIRST_VALUE(close) OVER w AS month_start,
LAST_VALUE(close) OVER w AS month_end
FROM stock_prices
WHERE EXTRACT(MONTH FROM date) = 6
WINDOW w AS (
PARTITION BY ticker
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
)
SELECT DISTINCT ticker,
ROUND((month_end - month_start) / month_start * 100, 2) AS return_pct
FROM monthly
ORDER BY return_pct DESC;
Sample Output:
- TSLA — June Return +2.47% ← Strongest this month
- AMZN — June Return +0.02% ← Basically flat
- AAPL — June Return -0.39% ← Minor pullback
- GOOG — June Return -0.46%
- MSFT — June Return -0.63% ← Weakest this month
Key Point: LAST_VALUE + WINDOW is DuckDB’s killer feature. In one statement, you get monthly return rankings without needing Pandas’ pivot or groupby operations. The ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING clause ensures LAST_VALUE captures the last row in the window (without it, LAST_VALUE defaults to the current row only).
Step 5: Performance Verification — Why DuckDB?
import time
# Timing 100 aggregation queries
start = time.time()
for _ in range(100):
conn.execute("""
SELECT AVG(close), STDDEV(close)
FROM stock_prices
WHERE date BETWEEN '2026-01-01' AND '2026-06-28'
""").fetchall()
elapsed_ms = (time.time() - start) * 1000
print(f"100 queries total: {elapsed_ms:.1f}ms (avg {elapsed_ms/100:.2f}ms/query)")
Actual Result: 840 rows, 100 queries completed in 39.6ms (average 0.40ms per query)
Even when data scales to millions of rows, DuckDB’s vectorized execution engine maintains sub-second response times. This is the core advantage in production scenarios — your users never have to wait.
Performance Comparison with Traditional Tools
| Tool | 1M Rows Aggregation Query | Memory Usage | Learning Curve | Deployment Difficulty |
|---|---|---|---|---|
| DuckDB | ~50ms | ~50MB | SQL only | Zero dependencies |
| Pandas | ~2000ms | ~500MB | Python programming | Requires library install |
| PostgreSQL | ~200ms | ~100MB | SQL + operations | Requires service install |
| Excel | ~30000ms | ~1GB | Formulas | Zero deps but slow |
DuckDB leads in query speed and memory efficiency across the board, and requires zero database service processes — this is what “zero operations” means in practice.
Automated Deployment
Main Python Script
import duckdb
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def generate_report(db_path='portfolio.db'):
conn = duckdb.connect(db_path)
# Generate report text
lines = []
lines.append(f"📊 Daily Market Brief — {datetime.now().strftime('%Y-%m-%d')}")
lines.append("")
# 1. Portfolio Overview
lines.append("═══ Portfolio Overview ═══")
for row in conn.execute("""
SELECT ticker,
ROUND(LAST(close), 2),
ROUND((LAST(close) - FIRST(close)) / FIRST(close) * 100, 2)
FROM stock_prices
GROUP BY ticker
ORDER BY (LAST(close) - FIRST(close)) / FIRST(close) DESC
""").fetchall():
direction = "📈" if row[2] > 0 else "📉"
lines.append(f" {direction} {row[0]}: ${row[1]} ({row[2]}%)")
lines.append("")
# 2. Volatility Rankings
lines.append("═══ Volatility Rankings ═══")
for row in conn.execute("""
SELECT ticker, ROUND(STDDEV(close) / AVG(close) * 100, 2)
FROM stock_prices GROUP BY ticker ORDER BY 2 DESC LIMIT 3
""").fetchall():
lines.append(f" 🔥 {row[0]}: {row[1]}%")
return "\n".join(lines)
report = generate_report()
print(report)
# Send report (email / Telegram Bot API)
# msg = MIMEText(report)
# msg['Subject'] = 'Daily Market Brief'
# msg['From'] = '[email protected]'
# msg['To'] = '[email protected]'
# smtplib.SMTP('smtp.gmail.com', 587).send_message(msg)
Schedule with Cron
# Auto-generate and email report daily at 21:00
0 21 * * * /usr/bin/python3 /opt/scripts/stock_report.py
Monetization Paths
This system has four direct monetization routes:
1. Data Product — Package reports into a paid Telegram channel or Newsletter at ¥99-299/month
2. API Service — Wrap the analysis logic into a market data API for other companies at ¥500-2000/request
3. Consulting — Build custom analysis pipelines for investment firms at ¥3000-10000/project
4. SaaS Backend — Use DuckDB as the analysis engine for SaaS products, cutting server costs by 90% while staying open source
Core Logic: DuckDB lets one person accomplish what used to require a small team. While others are buying servers and hiring DBAs, you’re delivering products with a few lines of SQL.
Key Takeaways
- COPY FROM beats INSERT — Bulk import is 100x faster than row-by-row INSERT
- LAST_VALUE + WINDOW is powerful — Monthly return rankings in one statement, no Pandas pivot needed
- Create indexes strategically — Index on
dateandtickerfor 10x+ query speedup - File mode for persistence — Use
.duckdbfiles instead of:memory:for cross-session data survival - 0.4ms query speed — Even with 1000x data growth, DuckDB’s vectorized engine holds strong
This system takes 30 minutes to build and run. What’s truly valuable is the insight and differentiation you add on top — that’s your competitive edge, not the tool itself.
💡 All SQL templates and Python scripts from this article are organized into a runnable project repository, including real-time data fetching, automated pushing, and performance tuning guides. To dive deeper into DuckDB for financial applications, duckdblab.org has a complete series of practical tutorials covering everything from basic queries to enterprise-level deployment.
