DuckDB Reads Excel Natively: Replace Pandas, 10 Seconds for 10 Files

1. The Excel Data Nightmare Every Professional Knows

You work in finance, operations, or sales support. Monday morning arrives, and your boss messages:

“I need a Q1 sales summary from all regions — we have a meeting this afternoon.”

Sounds simple? But when you open your inbox to find 10 attachments, each with a different Excel format, your heart sinks:

Some sheets are named “Sheet1”, others “Sales Data” or “销售数据”
Some columns are called “revenue”, others “amount” or “销售额”
Some files are .xlsx, others are ancient .xls
Each file ranges from 10MB to 50MB

Traditionally, you have a few options:

Manual Copy-Paste: Open 10 files, copy and paste one by one — takes 30-60 minutes and error-prone
Python Pandas: Write read_excel() + concat() + groupby(), but it eats 4-8GB of RAM, and large files cause OOM crashes
VBA Macros: High maintenance cost, breaks when switching computers
Paid BI Tools: Tableau / Power BI can do it, but licenses cost $70-100/user/month

Is there a simpler way?

The answer: DuckDB’s excel extension — read, transform, merge, and write Excel files with a single SQL query. No Python required, no Office installation needed, and memory usage is 1/40th of Pandas.

2. What Is the DuckDB Excel Extension?

The community-developed excel extension enables DuckDB to directly read and write Microsoft Excel (.xlsx) files as if they were CSV or Parquet files.

-- Install and load the extension
INSTALL excel;
LOAD excel;

-- Query an Excel file directly
SELECT * FROM 'sales_report.xlsx';

That’s it. DuckDB treats Excel files as standard tables. You can:

Read with SELECT
Write with INSERT/CREATE TABLE AS
Export with COPY ... TO
JOIN across multiple files
Use in any DuckDB environment (CLI, Python, R, Node.js)

3. Practical Scenario: Consolidating 10 Excel Files

Scenario Description

You’re an operations analyst at an e-commerce company. The sales team sent 10 regional reports (East, West, North, South, Central, Northeast, Northwest, Hong Kong/Macau/Taiwan, International, Online) for Q1 2026.

Each file has slightly different column names but the same three columns:

Column (may vary)	Meaning
`region` / `area` / `区域`	Region name
`sales_person` / `name` / `姓名`	Salesperson name
`revenue` / `amount` / `销售额`	Revenue (CNY)

The Pandas Era Approach

import pandas as pd
import os

files = ['east.xlsx', 'west.xlsx', 'north.xlsx', ...]  # 10 files

# Read one by one, handling format differences
dfs = []
for f in files:
    df = pd.read_excel(f)
    # Standardize column names
    df.columns = ['region', 'sales_person', 'revenue']
    # Handle separators and nulls
    df['revenue'] = df['revenue'].replace(',', '', regex=True).astype(float)
    dfs.append(df)

# Merge
result = pd.concat(dfs, ignore_index=True)

# Aggregate
summary = result.groupby('region')['revenue'].agg(['sum', 'count', 'mean'])

# Export
summary.to_excel('q1_sales_summary.xlsx')

This code looks reasonable, but in practice:

10 x 30MB Excel files → 4-8GB of RAM (Pandas loads everything at once)
Files over 50MB → near-certain OOM crash
Execution time: 3-8 minutes
Code: 30+ lines

The DuckDB Solution

-- 1. Load the excel extension
INSTALL excel;
LOAD excel;

-- 2. Read and aggregate in one shot
CREATE TABLE q1_summary AS
SELECT 
    region,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT sales_person) AS salesperson_count,
    AVG(revenue) AS avg_per_person
FROM (
    SELECT * FROM 'east.xlsx'
    UNION ALL
    SELECT * FROM 'west.xlsx'
    UNION ALL
    SELECT * FROM 'north.xlsx'
    UNION ALL
    SELECT * FROM 'south.xlsx'
    UNION ALL
    SELECT * FROM 'central.xlsx'
    UNION ALL
    SELECT * FROM 'northeast.xlsx'
    UNION ALL
    SELECT * FROM 'northwest.xlsx'
    UNION ALL
    SELECT * FROM 'hkmacau.xlsx'
    UNION ALL
    SELECT * FROM 'international.xlsx'
    UNION ALL
    SELECT * FROM 'online.xlsx'
)
GROUP BY region
ORDER BY total_revenue DESC;

-- 3. Export back to Excel
COPY q1_summary TO 'q1_sales_summary.xlsx' (FORMAT excel);

Pro tip: Use glob patterns when filenames follow a convention

-- Match all regional files with a wildcard
CREATE TABLE all_sales AS
SELECT * FROM read_csv('region_*.xlsx', auto_detect=true);

-- One-step aggregation
SELECT region, SUM(revenue) AS total_revenue
FROM all_sales
GROUP BY region
ORDER BY total_revenue DESC;

Handling inconsistent column names is straightforward:

-- Standardize column names before UNION
SELECT region, sales_person, revenue FROM (
    SELECT region, sales_person, revenue FROM 'east.xlsx'
    UNION ALL
    SELECT area AS region, name AS sales_person, amount AS revenue FROM 'west.xlsx'
    -- ... normalize each file as needed
);

Performance Comparison

Dimension	Pandas (Traditional)	DuckDB Excel Extension
Memory Usage	4-8 GB	~100 MB
Execution Time	3-8 minutes	8-15 seconds
Lines of Code	30+ lines	3-5 lines SQL
File Size Limit	< 100MB (OOM beyond)	GB-scale, no pressure
Excel Version Support	Depends on openpyxl/xlrd	Native .xlsx support
Dependencies	pandas + openpyxl + xlrd	DuckDB + excel extension
Streaming	❌ Full load in memory	✅ Vectorized streaming
Cross-file JOIN	Merge then process	Native SQL JOIN
Export Formats	Excel/CSV	Excel/CSV/Parquet/JSON

4. More Practical Scenarios

Scenario 2: Cross-File VLOOKUP

You have an orders.xlsx file and a customer_tiers.xlsx file. You need to tag each order with the customer’s tier:

SELECT 
    o.order_id,
    o.customer_id,
    o.amount,
    c.customer_tier,
    CASE 
        WHEN c.customer_tier = 'VIP' THEN o.amount * 0.85
        WHEN c.customer_tier = 'Gold' THEN o.amount * 0.90
        ELSE o.amount * 0.95
    END AS discounted_amount
FROM 'orders.xlsx' o
JOIN 'customer_tiers.xlsx' c ON o.customer_id = c.customer_id
ORDER BY o.amount DESC;

Before: VLOOKUP formulas on 100K rows would freeze Excel. After: DuckDB SQL JOIN in 0.5 seconds.

Scenario 3: Excel Data Cleaning + Database Write

Marketing receives daily Excel reports from ad channels and needs to clean and write them to PostgreSQL:

-- Read Excel, clean, write to PostgreSQL
INSTALL postgres_scanner;
LOAD postgres_scanner;

ATTACH 'host=db.example.com dbname=marketing' AS pg_db (TYPE postgres);

CREATE TABLE pg_db.daily_report AS
SELECT 
    date,
    channel,
    UPPER(TRIM(channel_name)) AS channel_name,  -- normalize case
    CAST(REPLACE(spend, ',', '') AS DOUBLE) AS spend,  -- clean numbers
    CAST(REPLACE(impressions, ',', '') AS INTEGER) AS impressions,
    CAST(REPLACE(clicks, ',', '') AS INTEGER) AS clicks,
    spend / NULLIF(clicks, 0) AS cpc
FROM 'daily_ad_report_2026-05-11.xlsx'
WHERE date IS NOT NULL;  -- filter empty rows

Scenario 4: Excel to Parquet Migration

If your team is transitioning to more efficient data formats:

-- Excel → Parquet (5-10x compression, 100x faster queries)
COPY (
    SELECT * FROM 'historical_data.xlsx'
) TO 'historical_data.parquet' (FORMAT parquet, COMPRESSION zstd);

-- Future queries use Parquet
SELECT year, SUM(revenue) 
FROM 'historical_data.parquet' 
GROUP BY year;

5. Deep Comparison with Traditional Excel Tools

Tool	Best For	Memory Efficiency	Learning Curve	Speed (10 files × 30MB)	Large File Support	Automation	Cost
Excel Itself	Ad-hoc analysis, manual ops	Poor (>100K rows chokes)	✅ Zero	5-10 min (manual)	❌	❌ Needs VBA	$159/yr
Pandas	Python data science ecosystem	Poor (full load)	Moderate	3-8 min	Limited (<100MB)	✅ Python	Free
openpyxl	Fine-grained Excel ops	Very poor	Moderate	5-15 min	❌ <50MB	✅ Python	Free
VBA Macros	In-Excel automation	Good (row-by-row)	High	2-10 min	✅ Large	✅ In-Excel	Built-in
Power Query	Power BI ecosystem	Moderate	Moderate	1-3 min	✅	✅	Free+
DuckDB	Analytical data processing	✅ Excellent	⭐ Low (SQL)	8-15 sec	✅ GB-scale	✅ Any language	Free
Tableau Prep	Enterprise data prep	Good	High	1-2 min	✅	✅	$70/mo

6. Advanced Techniques

6.1 Using DuckDB with Excel in Python

You don’t have to abandon Python — instead, use DuckDB as the computation engine:

import duckdb
import pandas as pd

# DuckDB reads Excel directly, returns DataFrame
conn = duckdb.connect()
result = conn.execute("""
    SELECT 
        region,
        SUM(revenue) as total_revenue,
        COUNT(*) as order_count
    FROM 'sales.xlsx'
    GROUP BY region
""").fetchdf()

# The result is a Pandas DataFrame — visualize it
import matplotlib.pyplot as plt
result.plot.bar(x='region', y='total_revenue')
plt.show()

6.2 Command-Line One-Liners

# Install DuckDB CLI
curl -fsSL https://install.duckdb.org | sh

# One-liner: summarize Excel
duckdb -c "
LOAD excel;
SELECT region, SUM(revenue) FROM 'sales.xlsx' GROUP BY region;
"

# One-liner: convert Excel to Parquet
duckdb -c "
LOAD excel;
COPY (SELECT * FROM 'data.xlsx') TO 'data.parquet' (FORMAT parquet);
"

6.3 Scheduled Automation

# crontab -e
# Auto-consolidate Excel reports every Monday at 8:00 AM
0 8 * * 1 cd /home/reports && duckdb -c "
LOAD excel;
COPY (
    SELECT region, SUM(revenue) AS total
    FROM read_csv('region_*.xlsx', auto_detect=true)
    GROUP BY region
) TO 'weekly_summary.xlsx' (FORMAT excel);
"

7. Monetization Strategies

Mastering DuckDB + Excel processing opens up several revenue streams:

💰 Strategy 1: Excel Report Automation Service ($49-199/month)

SMBs (especially finance and operations teams) manually consolidate Excel files every week. You can:

Build DuckDB-powered automated report pipelines for clients
Clients drop Excel files into a folder → DuckDB auto-processes → clean output delivered
Pricing: $49/mo (basic) / $99/mo (cross-system integration) / $199/mo (with visualization dashboard)
Acquisition: Share Excel automation tutorials on LinkedIn/Medium → convert to clients via DMs

💰 Strategy 2: Excel → Data Warehouse Migration ($500-2,000/project)

Many companies want to migrate from Excel to proper data warehouses but are put off by ETL costs.

Use DuckDB as the intermediary engine: Excel → Parquet → DuckDB → BI Tool
Pricing: $500 (single migration) / $2,000 (includes training + automated pipeline)
Target: SMBs, startups, IT departments at traditional companies

💰 Strategy 3: SaaS - Excel Data Cleaning Platform

Build a SaaS product:

Users upload Excel → DuckDB processes in the cloud → outputs clean CSV/Parquet/Excel
Features: dedup, format standardization, cross-file JOIN, anomaly detection
Pricing: 7-day free trial → $29/mo (200MB) / $79/mo (2GB)
Tech stack: DuckDB (engine) + Streamlit/Gradio (frontend) + simple payment integration

💰 Strategy 4: Corporate Training

Many data analysts want to learn DuckDB but don’t know where to start.

Course: “DuckDB in Practice: From Excel to Data Warehouse”
Content: Excel processing + cross-database JOIN + performance optimization + real-world cases
Pricing: $99 (recorded online course) / $1,000-3,000/day (corporate training)
Platform: Udemy / LinkedIn Learning / your own site

💰 Strategy 5: Premium Templates & Scripts

DuckDB + Excel automation templates (SQL scripts + cron configurations)
Pricing: $29/set
Content: 10 ready-to-use SQL scripts covering common scenarios
Distribution: GitHub Sponsors / Gumroad / Gumroad

8. Summary

The DuckDB excel extension lets you read and write .xlsx files with SQL — 10 files in 10 seconds, 40x less memory, 90% less code. Switching from Pandas to DuckDB for Excel processing is the highest-ROI skill upgrade for data analysts in 2026.

Try it now:

# macOS / Linux
curl -fsSL https://install.duckdb.org | sh

# Launch DuckDB CLI
duckdb
# Inside the CLI
INSTALL excel;
LOAD excel;
SELECT * FROM 'your_file.xlsx';

Subscribe to DuckDB Lab (duckdblab.org) for weekly DuckDB tutorials, performance optimization tips, and monetization strategies.