One SQL to Query MySQL, PostgreSQL and CSV: DuckDB Cross-Database JOINs in Action

Company data scattered across MySQL order database, PostgreSQL user database, and Excel/CSV product catalogs? DuckDB ATTACH syntax lets you JOIN all three data sources in a single SQL query, eliminating the export-merge-VLOOKUP nightmare and boosting efficiency 50x.

1. The Data Silo Nightmare Every Analyst Knows

You work at an e-commerce company. The boss asks: “Among the top 100 products by last month’s sales, what’s the repeat customer rate?”

Where is the data?

  • Orders: in the MySQL transaction database, 7 tables sharded by month
  • Customer tags: in the PostgreSQL analytics database, marking new vs. returning customers
  • Product info: in an Excel/CSV file the operations team updates weekly (with changing column names)

The traditional approach? A painstaking three-step ordeal:

Step 1: Export sales from MySQL → run a query → save CSV (5 minutes)
Step 2: Export customer tags from PostgreSQL → run a query → save CSV (5 minutes)
Step 3: Merge three CSVs in Excel with VLOOKUP → hover, wait, pray (10 minutes, likely crash)
Step 4: Realize data is missing → re-export → re-VLOOKUP (double the pain)
Step 5: Boss says "add another dimension" → start from scratch (total meltdown)

Total time: 30 minutes to 1 hour. With large datasets, Excel crashes. And none of it is reusable — change the date range and you start over.

2. The DuckDB Solution: ATTACH + Cross-Database JOIN

DuckDB has a severely underrated feature: the ATTACH statement lets you mount external databases like mounting drives, then JOIN across data sources with plain SQL.

What does this mean? One SQL query across all three data sources — no exports, no merges, no VLOOKUP.

2.1 How ATTACH Works

-- Mount a SQLite database (simple example)
ATTACH 'path/to/file.db' AS my_db (TYPE SQLITE);

-- Now cross-database query is possible
SELECT *
FROM my_db.some_table AS a
JOIN main.public.another_table AS b ON a.id = b.id;

DuckDB supports ATTACH for these data sources:

Data SourceATTACH SyntaxType Identifier
SQLiteATTACH 'file.db' (TYPE SQLITE)SQLITE
MySQLATTACH '' (TYPE MYSQL)MYSQL
PostgreSQLATTACH 'pg_conn_str' (TYPE POSTGRES)POSTGRES
DuckDB nativeATTACH 'data.duckdb'DUCKDB
Delta LakeATTACH './delta_dir' (TYPE DELTA)DELTA

Note: Connecting to MySQL and PostgreSQL requires installing the corresponding extensions:

INSTALL mysql_scanner; LOAD mysql_scanner;
INSTALL postgres_scanner; LOAD postgres_scanner;

2.2 Full Walkthrough: E-Commerce Cross-Database Query

The script below simulates three data sources using DuckDB in-memory tables and CSV files — no real databases needed. Just copy and run.

#!/usr/bin/env python3
"""
DuckDB Cross-Database JOIN Demo
Scenario: E-commerce data across three sources, one SQL to unite them all

Prerequisites: pip install duckdb openpyxl
"""

import duckdb
import os

# ====== Step 1: Create Mock Data ======

# Mock MySQL orders table (CSV file)
orders_csv = """order_id,customer_id,product_id,amount,order_date
1001,201,5001,299.00,2026-05-01
1002,202,5002,159.00,2026-05-01
1003,201,5003,899.00,2026-05-02
1004,203,5001,299.00,2026-05-02
1005,204,5004,459.00,2026-05-03
1006,202,5002,159.00,2026-05-03
1007,205,5005,1299.00,2026-05-04
1008,203,5003,899.00,2026-05-04
1009,206,5001,299.00,2026-05-05
1010,201,5004,459.00,2026-05-05
"""

# Mock PostgreSQL users table (CSV file)
users_csv = """customer_id,name,city,member_level,register_date
201,Alice,Beijing,Gold,2025-01-15
202,Bob,Shanghai,Silver,2025-03-20
203,Carol,Guangzhou,Gold,2025-02-01
204,Dave,Shenzhen,Regular,2025-06-10
205,Eve,Hangzhou,Silver,2025-04-05
206,Frank,Chengdu,Regular,2025-08-15
"""

# Mock product catalog
products_csv = """product_id,product_name,category,unit_price,cost_price
5001,Wireless Earbuds,Electronics,299.00,180.00
5002,Insulated Mug,Home,159.00,80.00
5003,Smart Watch,Electronics,899.00,550.00
5004,Running Shoes,Apparel,459.00,280.00
5005,Tablet Stand,Electronics,1299.00,800.00
"""

# Write temp files
os.makedirs("day07_data", exist_ok=True)
with open("day07_data/orders.csv", "w") as f:
    f.write(orders_csv)
with open("day07_data/users.csv", "w") as f:
    f.write(users_csv)
with open("day07_data/products.csv", "w") as f:
    f.write(products_csv)

# ====== Step 2: Cross-Source JOIN with DuckDB ======

con = duckdb.connect()

# Mount CSV files as views — simulating "different data sources"
con.execute("""
    CREATE VIEW orders AS 
    SELECT * FROM read_csv_auto('day07_data/orders.csv')
""")
con.execute("""
    CREATE VIEW users AS 
    SELECT * FROM read_csv_auto('day07_data/users.csv')
""")
con.execute("""
    CREATE VIEW products AS 
    SELECT * FROM read_csv_auto('day07_data/products.csv')
""")

print("=" * 60)
print("📊 Cross-Source: What did Gold members buy?")
print("=" * 60)

# One SQL across three "data sources"
result = con.execute("""
    SELECT 
        u.name AS customer_name,
        u.city,
        u.member_level,
        p.product_name,
        p.category,
        o.amount AS unit_price,
        (o.amount - p.cost_price) AS gross_margin
    FROM orders o
    JOIN users u ON o.customer_id = u.customer_id
    JOIN products p ON o.product_id = p.product_id
    WHERE u.member_level IN ('Gold', 'Silver')
    ORDER BY o.amount DESC
""").fetchdf()

print(result.to_string(index=False))

print("\n" + "=" * 60)
print("💰 Daily Sales Summary (with Margins)")
print("=" * 60)

result2 = con.execute("""
    SELECT 
        o.order_date AS date,
        COUNT(DISTINCT o.order_id) AS orders,
        SUM(o.amount) AS revenue,
        SUM(o.amount - p.cost_price) AS gross_profit,
        ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit_per_order
    FROM orders o
    JOIN products p ON o.product_id = p.product_id
    GROUP BY o.order_date
    ORDER BY o.order_date
""").fetchdf()

print(result2.to_string(index=False))

print("\n" + "=" * 60)
print("🏆 Category Profitability Analysis")
print("=" * 60)

result3 = con.execute("""
    SELECT 
        p.category,
        COUNT(*) AS units_sold,
        SUM(o.amount) AS revenue,
        SUM(o.amount - p.cost_price) AS gross_profit,
        ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit,
        ROUND(SUM(o.amount - p.cost_price) / SUM(o.amount) * 100, 1) AS margin_pct
    FROM orders o
    JOIN products p ON o.product_id = p.product_id
    GROUP BY p.category
    ORDER BY gross_profit DESC
""").fetchdf()

print(result3.to_string(index=False))

# ====== Step 3: Export to Excel ======

try:
    con.execute("INSTALL spatial; LOAD spatial;")
    con.execute("""
        COPY (
            SELECT 
                o.order_id,
                u.name,
                u.city,
                u.member_level,
                p.product_name,
                p.category,
                o.amount,
                (o.amount - p.cost_price) AS profit
            FROM orders o
            JOIN users u ON o.customer_id = u.customer_id
            JOIN products p ON o.product_id = p.product_id
        ) TO 'day07_data/cross_source_report.xlsx' 
        WITH (FORMER XLSX);
    """)
    print("\n✅ Report exported: day07_data/cross_source_report.xlsx")
except Exception as e:
    print(f"\n⚠️ XLSX export needs spatial extension: {e}")
    print("Falling back to CSV:")
    con.execute("""
        COPY (
            SELECT 
                o.order_id,
                u.name,
                u.city,
                u.member_level,
                p.product_name,
                p.category,
                o.amount,
                (o.amount - p.cost_price) AS profit
            FROM orders o
            JOIN users u ON o.customer_id = u.customer_id
            JOIN products p ON o.product_id = p.product_id
        ) TO 'day07_data/cross_source_report.csv' (HEADER, DELIMITER ',');
    """)
    print("✅ Report exported: day07_data/cross_source_report.csv")

# ====== Step 4: Validation ======
print("\n" + "=" * 60)
print("📈 Validation: Are all orders matched to users and products?")
print("=" * 60)

validation = con.execute("""
    SELECT 
        'Total orders' AS metric, CAST(COUNT(*) AS VARCHAR) AS value FROM orders
    UNION ALL
    SELECT 'Matched to users', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN users u ON o.customer_id = u.customer_id
    UNION ALL
    SELECT 'Matched to products', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN products p ON o.product_id = p.product_id
    UNION ALL
    SELECT 'Fully matched', CAST(COUNT(*) AS VARCHAR) FROM orders o 
        JOIN users u ON o.customer_id = u.customer_id
        JOIN products p ON o.product_id = p.product_id
""").fetchdf()

print(validation.to_string(index=False))

con.close()
print("\n🎉 Cross-database JOIN demo complete!")

2.3 Real Environment: Connecting to MySQL + PostgreSQL

When you have actual MySQL and PostgreSQL databases, the script becomes:

-- Install extensions (one-time)
INSTALL mysql_scanner;
LOAD mysql_scanner;
INSTALL postgres_scanner;
LOAD postgres_scanner;

-- Mount MySQL orders database
ATTACH 'host=localhost port=3306 dbname=orders_db user=analyst password=xxx' 
    AS mysql_db (TYPE MYSQL);

-- Mount PostgreSQL user database
ATTACH 'host=localhost port=5432 dbname=users_db user=analyst password=xxx' 
    AS pg_db (TYPE POSTGRES);

-- Mount local product CSV
CREATE VIEW products AS 
SELECT * FROM read_csv_auto('products.csv');

-- One SQL across all three
SELECT 
    u.name,
    u.city,
    p.product_name,
    SUM(o.amount) AS total_spent
FROM mysql_db.orders AS o
JOIN pg_db.public.customers AS u ON o.customer_id = u.customer_id
JOIN products AS p ON o.product_id = p.product_id
WHERE o.order_date >= '2026-04-01'
GROUP BY u.name, u.city, p.product_name
ORDER BY total_spent DESC;

3. Performance Comparison: Traditional vs. DuckDB

ScenarioTraditional (Export + VLOOKUP)DuckDB ATTACH
Cross MySQL + PG + CSV query30 min ~ 1 hour10 ~ 30 seconds
Opening 500K rows in ExcelFreeze/crashSub-second results
Changing analysis dimensionsRe-export + re-VLOOKUPEdit one line of SQL
Scheduled report generationManual repeat every timeOne-click script
10GB+ datasetExcel/Pandas OOMStreaming, no pressure
Learning curveKnow VLOOKUPKnow standard SQL
DependenciesExcel + multiple database clientsJust DuckDB
ReusabilityEssentially zeroSQL script forever reusable

Quantified impact:

A real case — an e-commerce company needing daily cross-source operations reports:

  • Traditional: Data analyst spends 40 minutes daily exporting, merging, checking
  • DuckDB: Write SQL once, daily execution 15 seconds
  • Monthly time saved: 40 min × 22 workdays = 880 minutes (14.7 hours)
  • Cost savings: At ¥50/hour, that’s ¥735/month per person

4. How ATTACH Works Under the Hood

Understanding ATTACH’s internals helps you design and optimize cross-database queries.

4.1 ATTACH Is NOT ETL

ATTACH doesn’t copy data into DuckDB — it creates an external table reference. When you query, DuckDB pushes query predicates down to the source database, only pulling back what’s needed.

-- DuckDB executes this GROUP BY on MySQL
-- Only the aggregated results come back, not a full table scan
SELECT customer_id, COUNT(*) 
FROM mysql_db.orders 
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

DuckDB’s optimizer automatically pushes WHERE, GROUP BY, LIMIT operations down to the source database, minimizing data transfer.

4.2 Key Performance Factors

FactorDescriptionOptimization
Network latencyCross-db queries depend on networkDeploy DuckDB near your databases
Pushdown optimizationAggregations/filters execute remotelyUse WHERE to reduce data volume
Index utilizationSource indexes still workIndex your JOIN columns
Data volumeDuckDB doesn’t cache external tablesUse CREATE TABLE AS for repeated queries

4.3 Performance Tips

-- ❌ BAD: Pull everything, filter later
SELECT * FROM mysql_db.orders;  -- Could be millions of rows

-- ✅ GOOD: Push down filter + limit
SELECT * FROM mysql_db.orders 
WHERE order_date >= '2026-05-01' 
LIMIT 1000;

-- ✅ BEST: Aggregate first, then JOIN
WITH daily_stats AS (
    -- This aggregation runs on MySQL
    SELECT customer_id, DATE(order_date) AS day, SUM(amount) AS daily_total
    FROM mysql_db.orders
    WHERE order_date >= '2026-04-01'
    GROUP BY customer_id, DATE(order_date)
)
-- Then JOIN with local data
SELECT u.name, d.day, d.daily_total
FROM daily_stats d
JOIN pg_db.public.customers u ON d.customer_id = u.customer_id
ORDER BY d.daily_total DESC
LIMIT 20;

5. Advanced Use Cases

5.1 Data Migration: Cross-Database Copy

-- MySQL → DuckDB local table (one-time snapshot)
CREATE TABLE local_orders AS 
SELECT * FROM mysql_db.orders 
WHERE order_date >= '2026-01-01';

-- DuckDB → PostgreSQL (write back)
CREATE TABLE pg_db.public.report AS 
SELECT * FROM local_analytics;

5.2 Multi-Environment Comparison

-- Mount production and staging simultaneously
ATTACH 'prod_conn' AS prod (TYPE POSTGRES);
ATTACH 'staging_conn' AS staging (TYPE POSTGRES);

-- Compare data differences
SELECT 
    COALESCE(p.order_id, s.order_id) AS order_id,
    p.amount AS prod_amount,
    s.amount AS staging_amount,
    (p.amount - s.amount) AS diff
FROM prod.public.orders p
FULL OUTER JOIN staging.public.orders s 
    ON p.order_id = s.order_id
WHERE p.amount IS DISTINCT FROM s.amount;

5.3 Automated Scheduled Reports

# Cron-based daily report generation
import duckdb

con = duckdb.connect()

# ATTACH data sources
con.execute("ATTACH '...' AS mysql_db (TYPE MYSQL)")
con.execute("ATTACH '...' AS pg_db (TYPE POSTGRES)")

# Generate daily report
con.execute("""
    COPY (
        -- cross-source query...
    ) TO '/tmp/daily_report.csv' (HEADER, DELIMITER ',');
""")

# Send email via SMTP (pseudo-code)
# send_email(to='boss@company.com', attachment='/tmp/daily_report.csv')
print("✅ Daily report generated")

6. Connection Guide & Troubleshooting

6.1 MySQL Extension Setup

# Ensure MySQL client libraries are installed
apt-get install -y default-libmysqlclient-dev  # Ubuntu/Debian
# or
brew install mysql-client  # macOS

Then in DuckDB:

INSTALL mysql_scanner;
LOAD mysql_scanner;

6.2 Connection String Formats

SourceConnection String Example
MySQLhost=localhost port=3306 dbname=test user=root password=secret
PostgreSQLhost=localhost port=5432 dbname=test user=postgres password=secret
SQLite./data.db (just the file path)

6.3 Common Pitfalls

  1. MySQL 8.0 auth: Use mysql_native_password or update DuckDB to v1.5+ for caching_sha2_password
  2. PostgreSQL SSL: Add sslmode=require parameter
  3. Large table JOINs: If both sides are large, pull the smaller table into DuckDB locally first
  4. Character encoding: Default is UTF-8; MySQL latin1 may produce garbled text

7. Monetization Strategy

This skill has significant market value because 99% of companies have data silo problems.

Target Customers

  • SMBs: Data scattered across multiple systems, no dedicated data team
  • E-commerce: Order system + CRM + finance — all independent
  • Retail chains: Each store + HQ + supply chain — different data sources
  • Traditional enterprises in transition: Legacy databases coexisting with new systems

Pricing

ServicePriceDeliverablesTimeline
One-time data integration¥2,000-5,000 ($280-700)Cross-source query scripts + Excel report template1-3 days
Monthly report automation¥500-1,500/month ($70-210)Scheduled cross-source business reportsMonthly
Data warehouse setup¥5,000-15,000 ($700-2,100)Complete ETL pipeline + analytics dashboard1-2 weeks
Data integration training¥1,500-3,000/session ($210-420)Teach team to use DuckDB themselvesHalf day

Competitive Landscape

SolutionPriceStrengthWeakness
Traditional ETL (Kettle/DataX)Free but needs opsFull-featuredComplex configuration, steep learning curve
Commercial BI (Tableau/Power BI)¥500-2,000/monthGreat visualizationExpensive, weak cross-source capability
Hiring a manual analyst¥300-500/monthNo thinking requiredUnreliable, churn risk
DuckDB Solution (You)¥2,000-5,000One-time build, permanent useRequires basic technical client

Client Acquisition

  1. Freelance platforms (Upwork, Fiverr): Search “data integration,” “cross-database query,” “report automation” — pitch DuckDB solutions
  2. Industry communities: Join e-commerce, retail, or operations groups — ask “how do you generate your reports?”
  3. This blog post: Share it as proof of expertise. Every time someone reads it, they know you can solve their data silo problem.

Sales Pitch Template

“I see your company has data spread across different systems — you probably spend hours manually merging each report. I have a solution that connects all your data sources with one SQL query. After setup, you click once and get a complete report. Integration costs ¥3,000, then ¥800/month for automated monthly reports. Interested in a free data assessment first?”

8. Conclusion

DuckDB’s ATTACH + cross-database JOIN capability is this database’s most underrated killer feature. It frees data analysts from the primitive “export → merge → VLOOKUP” workflow, compressing cross-source query time from hours to seconds.

More importantly, enterprise data silos are a universal pain point, and DuckDB provides a low-cost, easy-to-learn, immediately effective solution. Master this skill, and you can solve real-world integration problems — at a clear, billable price.

What to do today:

  1. Run the demo script above to understand ATTACH syntax
  2. Find the most fragmented data scenario in your company
  3. Connect it with DuckDB and show the result to your boss
  4. Use this case study to win external clients

Cleanup:

rm -rf day07_data/