One SQL to Query MySQL, PostgreSQL and CSV: DuckDB Cross-Database JOINs in Action

1. The Data Silo Nightmare Every Analyst Knows

You work at an e-commerce company. The boss asks: “Among the top 100 products by last month’s sales, what’s the repeat customer rate?”

Where is the data?

Orders: in the MySQL transaction database, 7 tables sharded by month
Customer tags: in the PostgreSQL analytics database, marking new vs. returning customers
Product info: in an Excel/CSV file the operations team updates weekly (with changing column names)

The traditional approach? A painstaking three-step ordeal:

Step 1: Export sales from MySQL → run a query → save CSV (5 minutes)
Step 2: Export customer tags from PostgreSQL → run a query → save CSV (5 minutes)
Step 3: Merge three CSVs in Excel with VLOOKUP → hover, wait, pray (10 minutes, likely crash)
Step 4: Realize data is missing → re-export → re-VLOOKUP (double the pain)
Step 5: Boss says "add another dimension" → start from scratch (total meltdown)

Total time: 30 minutes to 1 hour. With large datasets, Excel crashes. And none of it is reusable — change the date range and you start over.

2. The DuckDB Solution: ATTACH + Cross-Database JOIN

DuckDB has a severely underrated feature: the ATTACH statement lets you mount external databases like mounting drives, then JOIN across data sources with plain SQL.

What does this mean? One SQL query across all three data sources — no exports, no merges, no VLOOKUP.

2.1 How ATTACH Works

-- Mount a SQLite database (simple example)
ATTACH 'path/to/file.db' AS my_db (TYPE SQLITE);

-- Now cross-database query is possible
SELECT *
FROM my_db.some_table AS a
JOIN main.public.another_table AS b ON a.id = b.id;

DuckDB supports ATTACH for these data sources:

Data Source	ATTACH Syntax	Type Identifier
SQLite	`ATTACH 'file.db' (TYPE SQLITE)`	SQLITE
MySQL	`ATTACH '' (TYPE MYSQL)`	MYSQL
PostgreSQL	`ATTACH 'pg_conn_str' (TYPE POSTGRES)`	POSTGRES
DuckDB native	`ATTACH 'data.duckdb'`	DUCKDB
Delta Lake	`ATTACH './delta_dir' (TYPE DELTA)`	DELTA

Note: Connecting to MySQL and PostgreSQL requires installing the corresponding extensions:
INSTALL mysql_scanner; LOAD mysql_scanner;
INSTALL postgres_scanner; LOAD postgres_scanner;

2.2 Full Walkthrough: E-Commerce Cross-Database Query

The script below simulates three data sources using DuckDB in-memory tables and CSV files — no real databases needed. Just copy and run.

#!/usr/bin/env python3
"""
DuckDB Cross-Database JOIN Demo
Scenario: E-commerce data across three sources, one SQL to unite them all

Prerequisites: pip install duckdb openpyxl
"""

import duckdb
import os

# ====== Step 1: Create Mock Data ======

# Mock MySQL orders table (CSV file)
orders_csv = """order_id,customer_id,product_id,amount,order_date
1001,201,5001,299.00,2026-05-01
1002,202,5002,159.00,2026-05-01
1003,201,5003,899.00,2026-05-02
1004,203,5001,299.00,2026-05-02
1005,204,5004,459.00,2026-05-03
1006,202,5002,159.00,2026-05-03
1007,205,5005,1299.00,2026-05-04
1008,203,5003,899.00,2026-05-04
1009,206,5001,299.00,2026-05-05
1010,201,5004,459.00,2026-05-05
"""

# Mock PostgreSQL users table (CSV file)
users_csv = """customer_id,name,city,member_level,register_date
201,Alice,Beijing,Gold,2025-01-15
202,Bob,Shanghai,Silver,2025-03-20
203,Carol,Guangzhou,Gold,2025-02-01
204,Dave,Shenzhen,Regular,2025-06-10
205,Eve,Hangzhou,Silver,2025-04-05
206,Frank,Chengdu,Regular,2025-08-15
"""

# Mock product catalog
products_csv = """product_id,product_name,category,unit_price,cost_price
5001,Wireless Earbuds,Electronics,299.00,180.00
5002,Insulated Mug,Home,159.00,80.00
5003,Smart Watch,Electronics,899.00,550.00
5004,Running Shoes,Apparel,459.00,280.00
5005,Tablet Stand,Electronics,1299.00,800.00
"""

# Write temp files
os.makedirs("day07_data", exist_ok=True)
with open("day07_data/orders.csv", "w") as f:
    f.write(orders_csv)
with open("day07_data/users.csv", "w") as f:
    f.write(users_csv)
with open("day07_data/products.csv", "w") as f:
    f.write(products_csv)

# ====== Step 2: Cross-Source JOIN with DuckDB ======

con = duckdb.connect()

# Mount CSV files as views — simulating "different data sources"
con.execute("""
    CREATE VIEW orders AS 
    SELECT * FROM read_csv_auto('day07_data/orders.csv')
""")
con.execute("""
    CREATE VIEW users AS 
    SELECT * FROM read_csv_auto('day07_data/users.csv')
""")
con.execute("""
    CREATE VIEW products AS 
    SELECT * FROM read_csv_auto('day07_data/products.csv')
""")

print("=" * 60)
print("📊 Cross-Source: What did Gold members buy?")
print("=" * 60)

# One SQL across three "data sources"
result = con.execute("""
    SELECT 
        u.name AS customer_name,
        u.city,
        u.member_level,
        p.product_name,
        p.category,
        o.amount AS unit_price,
        (o.amount - p.cost_price) AS gross_margin
    FROM orders o
    JOIN users u ON o.customer_id = u.customer_id
    JOIN products p ON o.product_id = p.product_id
    WHERE u.member_level IN ('Gold', 'Silver')
    ORDER BY o.amount DESC
""").fetchdf()

print(result.to_string(index=False))

print("\n" + "=" * 60)
print("💰 Daily Sales Summary (with Margins)")
print("=" * 60)

result2 = con.execute("""
    SELECT 
        o.order_date AS date,
        COUNT(DISTINCT o.order_id) AS orders,
        SUM(o.amount) AS revenue,
        SUM(o.amount - p.cost_price) AS gross_profit,
        ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit_per_order
    FROM orders o
    JOIN products p ON o.product_id = p.product_id
    GROUP BY o.order_date
    ORDER BY o.order_date
""").fetchdf()

print(result2.to_string(index=False))

print("\n" + "=" * 60)
print("🏆 Category Profitability Analysis")
print("=" * 60)

result3 = con.execute("""
    SELECT 
        p.category,
        COUNT(*) AS units_sold,
        SUM(o.amount) AS revenue,
        SUM(o.amount - p.cost_price) AS gross_profit,
        ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit,
        ROUND(SUM(o.amount - p.cost_price) / SUM(o.amount) * 100, 1) AS margin_pct
    FROM orders o
    JOIN products p ON o.product_id = p.product_id
    GROUP BY p.category
    ORDER BY gross_profit DESC
""").fetchdf()

print(result3.to_string(index=False))

# ====== Step 3: Export to Excel ======

try:
    con.execute("INSTALL spatial; LOAD spatial;")
    con.execute("""
        COPY (
            SELECT 
                o.order_id,
                u.name,
                u.city,
                u.member_level,
                p.product_name,
                p.category,
                o.amount,
                (o.amount - p.cost_price) AS profit
            FROM orders o
            JOIN users u ON o.customer_id = u.customer_id
            JOIN products p ON o.product_id = p.product_id
        ) TO 'day07_data/cross_source_report.xlsx' 
        WITH (FORMER XLSX);
    """)
    print("\n✅ Report exported: day07_data/cross_source_report.xlsx")
except Exception as e:
    print(f"\n⚠️ XLSX export needs spatial extension: {e}")
    print("Falling back to CSV:")
    con.execute("""
        COPY (
            SELECT 
                o.order_id,
                u.name,
                u.city,
                u.member_level,
                p.product_name,
                p.category,
                o.amount,
                (o.amount - p.cost_price) AS profit
            FROM orders o
            JOIN users u ON o.customer_id = u.customer_id
            JOIN products p ON o.product_id = p.product_id
        ) TO 'day07_data/cross_source_report.csv' (HEADER, DELIMITER ',');
    """)
    print("✅ Report exported: day07_data/cross_source_report.csv")

# ====== Step 4: Validation ======
print("\n" + "=" * 60)
print("📈 Validation: Are all orders matched to users and products?")
print("=" * 60)

validation = con.execute("""
    SELECT 
        'Total orders' AS metric, CAST(COUNT(*) AS VARCHAR) AS value FROM orders
    UNION ALL
    SELECT 'Matched to users', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN users u ON o.customer_id = u.customer_id
    UNION ALL
    SELECT 'Matched to products', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN products p ON o.product_id = p.product_id
    UNION ALL
    SELECT 'Fully matched', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN users u ON o.customer_id = u.customer_id
        JOIN products p ON o.product_id = p.product_id
""").fetchdf()

print(validation.to_string(index=False))

con.close()
print("\n🎉 Cross-database JOIN demo complete!")

2.3 Real Environment: Connecting to MySQL + PostgreSQL

When you have actual MySQL and PostgreSQL databases, the script becomes:

-- Install extensions (one-time)
INSTALL mysql_scanner;
LOAD mysql_scanner;
INSTALL postgres_scanner;
LOAD postgres_scanner;

-- Mount MySQL orders database
ATTACH 'host=localhost port=3306 dbname=orders_db user=analyst password=xxx' 
    AS mysql_db (TYPE MYSQL);

-- Mount PostgreSQL user database
ATTACH 'host=localhost port=5432 dbname=users_db user=analyst password=xxx' 
    AS pg_db (TYPE POSTGRES);

-- Mount local product CSV
CREATE VIEW products AS 
SELECT * FROM read_csv_auto('products.csv');

-- One SQL across all three
SELECT 
    u.name,
    u.city,
    p.product_name,
    SUM(o.amount) AS total_spent
FROM mysql_db.orders AS o
JOIN pg_db.public.customers AS u ON o.customer_id = u.customer_id
JOIN products AS p ON o.product_id = p.product_id
WHERE o.order_date >= '2026-04-01'
GROUP BY u.name, u.city, p.product_name
ORDER BY total_spent DESC;

3. Performance Comparison: Traditional vs. DuckDB

Scenario	Traditional (Export + VLOOKUP)	DuckDB ATTACH
Cross MySQL + PG + CSV query	30 min ~ 1 hour	10 ~ 30 seconds
Opening 500K rows in Excel	Freeze/crash	Sub-second results
Changing analysis dimensions	Re-export + re-VLOOKUP	Edit one line of SQL
Scheduled report generation	Manual repeat every time	One-click script
10GB+ dataset	Excel/Pandas OOM	Streaming, no pressure
Learning curve	Know VLOOKUP	Know standard SQL
Dependencies	Excel + multiple database clients	Just DuckDB
Reusability	Essentially zero	SQL script forever reusable

Quantified impact:

A real case — an e-commerce company needing daily cross-source operations reports:

Traditional: Data analyst spends 40 minutes daily exporting, merging, checking
DuckDB: Write SQL once, daily execution 15 seconds
Monthly time saved: 40 min × 22 workdays = 880 minutes (14.7 hours)
Cost savings: At ¥50/hour, that’s ¥735/month per person

4. How ATTACH Works Under the Hood

Understanding ATTACH’s internals helps you design and optimize cross-database queries.

4.1 ATTACH Is NOT ETL

ATTACH doesn’t copy data into DuckDB — it creates an external table reference. When you query, DuckDB pushes query predicates down to the source database, only pulling back what’s needed.

-- DuckDB executes this GROUP BY on MySQL
-- Only the aggregated results come back, not a full table scan
SELECT customer_id, COUNT(*) 
FROM mysql_db.orders 
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

DuckDB’s optimizer automatically pushes WHERE, GROUP BY, LIMIT operations down to the source database, minimizing data transfer.

4.2 Key Performance Factors

Factor	Description	Optimization
Network latency	Cross-db queries depend on network	Deploy DuckDB near your databases
Pushdown optimization	Aggregations/filters execute remotely	Use WHERE to reduce data volume
Index utilization	Source indexes still work	Index your JOIN columns
Data volume	DuckDB doesn’t cache external tables	Use CREATE TABLE AS for repeated queries

4.3 Performance Tips

-- ❌ BAD: Pull everything, filter later
SELECT * FROM mysql_db.orders;  -- Could be millions of rows

-- ✅ GOOD: Push down filter + limit
SELECT * FROM mysql_db.orders 
WHERE order_date >= '2026-05-01' 
LIMIT 1000;

-- ✅ BEST: Aggregate first, then JOIN
WITH daily_stats AS (
    -- This aggregation runs on MySQL
    SELECT customer_id, DATE(order_date) AS day, SUM(amount) AS daily_total
    FROM mysql_db.orders
    WHERE order_date >= '2026-04-01'
    GROUP BY customer_id, DATE(order_date)
)
-- Then JOIN with local data
SELECT u.name, d.day, d.daily_total
FROM daily_stats d
JOIN pg_db.public.customers u ON d.customer_id = u.customer_id
ORDER BY d.daily_total DESC
LIMIT 20;

5. Advanced Use Cases

5.1 Data Migration: Cross-Database Copy

-- MySQL → DuckDB local table (one-time snapshot)
CREATE TABLE local_orders AS 
SELECT * FROM mysql_db.orders 
WHERE order_date >= '2026-01-01';

-- DuckDB → PostgreSQL (write back)
CREATE TABLE pg_db.public.report AS 
SELECT * FROM local_analytics;

5.2 Multi-Environment Comparison

-- Mount production and staging simultaneously
ATTACH 'prod_conn' AS prod (TYPE POSTGRES);
ATTACH 'staging_conn' AS staging (TYPE POSTGRES);

-- Compare data differences
SELECT 
    COALESCE(p.order_id, s.order_id) AS order_id,
    p.amount AS prod_amount,
    s.amount AS staging_amount,
    (p.amount - s.amount) AS diff
FROM prod.public.orders p
FULL OUTER JOIN staging.public.orders s 
    ON p.order_id = s.order_id
WHERE p.amount IS DISTINCT FROM s.amount;

5.3 Automated Scheduled Reports

# Cron-based daily report generation
import duckdb

con = duckdb.connect()

# ATTACH data sources
con.execute("ATTACH '...' AS mysql_db (TYPE MYSQL)")
con.execute("ATTACH '...' AS pg_db (TYPE POSTGRES)")

# Generate daily report
con.execute("""
    COPY (
        -- cross-source query...
    ) TO '/tmp/daily_report.csv' (HEADER, DELIMITER ',');
""")

# Send email via SMTP (pseudo-code)
# send_email(to='boss@company.com', attachment='/tmp/daily_report.csv')
print("✅ Daily report generated")

6. Connection Guide & Troubleshooting

6.1 MySQL Extension Setup

# Ensure MySQL client libraries are installed
apt-get install -y default-libmysqlclient-dev  # Ubuntu/Debian
# or
brew install mysql-client  # macOS

Then in DuckDB:

INSTALL mysql_scanner;
LOAD mysql_scanner;

6.2 Connection String Formats

Source	Connection String Example
MySQL	`host=localhost port=3306 dbname=test user=root password=secret`
PostgreSQL	`host=localhost port=5432 dbname=test user=postgres password=secret`
SQLite	`./data.db` (just the file path)

6.3 Common Pitfalls

MySQL 8.0 auth: Use mysql_native_password or update DuckDB to v1.5+ for caching_sha2_password
PostgreSQL SSL: Add sslmode=require parameter
Large table JOINs: If both sides are large, pull the smaller table into DuckDB locally first
Character encoding: Default is UTF-8; MySQL latin1 may produce garbled text

7. Monetization Strategy

This skill has significant market value because 99% of companies have data silo problems.

Target Customers

SMBs: Data scattered across multiple systems, no dedicated data team
E-commerce: Order system + CRM + finance — all independent
Retail chains: Each store + HQ + supply chain — different data sources
Traditional enterprises in transition: Legacy databases coexisting with new systems

Pricing

Service	Price	Deliverables	Timeline
One-time data integration	¥2,000-5,000 ($280-700)	Cross-source query scripts + Excel report template	1-3 days
Monthly report automation	¥500-1,500/month ($70-210)	Scheduled cross-source business reports	Monthly
Data warehouse setup	¥5,000-15,000 ($700-2,100)	Complete ETL pipeline + analytics dashboard	1-2 weeks
Data integration training	¥1,500-3,000/session ($210-420)	Teach team to use DuckDB themselves	Half day

Competitive Landscape

Solution	Price	Strength	Weakness
Traditional ETL (Kettle/DataX)	Free but needs ops	Full-featured	Complex configuration, steep learning curve
Commercial BI (Tableau/Power BI)	¥500-2,000/month	Great visualization	Expensive, weak cross-source capability
Hiring a manual analyst	¥300-500/month	No thinking required	Unreliable, churn risk
DuckDB Solution (You)	¥2,000-5,000	One-time build, permanent use	Requires basic technical client

Client Acquisition

Freelance platforms (Upwork, Fiverr): Search “data integration,” “cross-database query,” “report automation” — pitch DuckDB solutions
Industry communities: Join e-commerce, retail, or operations groups — ask “how do you generate your reports?”
This blog post: Share it as proof of expertise. Every time someone reads it, they know you can solve their data silo problem.

Sales Pitch Template

“I see your company has data spread across different systems — you probably spend hours manually merging each report. I have a solution that connects all your data sources with one SQL query. After setup, you click once and get a complete report. Integration costs ¥3,000, then ¥800/month for automated monthly reports. Interested in a free data assessment first?”

8. Conclusion

DuckDB’s ATTACH + cross-database JOIN capability is this database’s most underrated killer feature. It frees data analysts from the primitive “export → merge → VLOOKUP” workflow, compressing cross-source query time from hours to seconds.

More importantly, enterprise data silos are a universal pain point, and DuckDB provides a low-cost, easy-to-learn, immediately effective solution. Master this skill, and you can solve real-world integration problems — at a clear, billable price.

What to do today:

Run the demo script above to understand ATTACH syntax
Find the most fragmented data scenario in your company
Connect it with DuckDB and show the result to your boss
Use this case study to win external clients

Cleanup:

rm -rf day07_data/