1. The Data Silo Nightmare Every Analyst Knows
You work at an e-commerce company. The boss asks: “Among the top 100 products by last month’s sales, what’s the repeat customer rate?”
Where is the data?
- Orders: in the MySQL transaction database, 7 tables sharded by month
- Customer tags: in the PostgreSQL analytics database, marking new vs. returning customers
- Product info: in an Excel/CSV file the operations team updates weekly (with changing column names)
The traditional approach? A painstaking three-step ordeal:
Step 1: Export sales from MySQL → run a query → save CSV (5 minutes)
Step 2: Export customer tags from PostgreSQL → run a query → save CSV (5 minutes)
Step 3: Merge three CSVs in Excel with VLOOKUP → hover, wait, pray (10 minutes, likely crash)
Step 4: Realize data is missing → re-export → re-VLOOKUP (double the pain)
Step 5: Boss says "add another dimension" → start from scratch (total meltdown)
Total time: 30 minutes to 1 hour. With large datasets, Excel crashes. And none of it is reusable — change the date range and you start over.
2. The DuckDB Solution: ATTACH + Cross-Database JOIN
DuckDB has a severely underrated feature: the ATTACH statement lets you mount external databases like mounting drives, then JOIN across data sources with plain SQL.
What does this mean? One SQL query across all three data sources — no exports, no merges, no VLOOKUP.
2.1 How ATTACH Works
-- Mount a SQLite database (simple example)
ATTACH 'path/to/file.db' AS my_db (TYPE SQLITE);
-- Now cross-database query is possible
SELECT *
FROM my_db.some_table AS a
JOIN main.public.another_table AS b ON a.id = b.id;
DuckDB supports ATTACH for these data sources:
| Data Source | ATTACH Syntax | Type Identifier |
|---|---|---|
| SQLite | ATTACH 'file.db' (TYPE SQLITE) | SQLITE |
| MySQL | ATTACH '' (TYPE MYSQL) | MYSQL |
| PostgreSQL | ATTACH 'pg_conn_str' (TYPE POSTGRES) | POSTGRES |
| DuckDB native | ATTACH 'data.duckdb' | DUCKDB |
| Delta Lake | ATTACH './delta_dir' (TYPE DELTA) | DELTA |
Note: Connecting to MySQL and PostgreSQL requires installing the corresponding extensions:
INSTALL mysql_scanner; LOAD mysql_scanner; INSTALL postgres_scanner; LOAD postgres_scanner;
2.2 Full Walkthrough: E-Commerce Cross-Database Query
The script below simulates three data sources using DuckDB in-memory tables and CSV files — no real databases needed. Just copy and run.
#!/usr/bin/env python3
"""
DuckDB Cross-Database JOIN Demo
Scenario: E-commerce data across three sources, one SQL to unite them all
Prerequisites: pip install duckdb openpyxl
"""
import duckdb
import os
# ====== Step 1: Create Mock Data ======
# Mock MySQL orders table (CSV file)
orders_csv = """order_id,customer_id,product_id,amount,order_date
1001,201,5001,299.00,2026-05-01
1002,202,5002,159.00,2026-05-01
1003,201,5003,899.00,2026-05-02
1004,203,5001,299.00,2026-05-02
1005,204,5004,459.00,2026-05-03
1006,202,5002,159.00,2026-05-03
1007,205,5005,1299.00,2026-05-04
1008,203,5003,899.00,2026-05-04
1009,206,5001,299.00,2026-05-05
1010,201,5004,459.00,2026-05-05
"""
# Mock PostgreSQL users table (CSV file)
users_csv = """customer_id,name,city,member_level,register_date
201,Alice,Beijing,Gold,2025-01-15
202,Bob,Shanghai,Silver,2025-03-20
203,Carol,Guangzhou,Gold,2025-02-01
204,Dave,Shenzhen,Regular,2025-06-10
205,Eve,Hangzhou,Silver,2025-04-05
206,Frank,Chengdu,Regular,2025-08-15
"""
# Mock product catalog
products_csv = """product_id,product_name,category,unit_price,cost_price
5001,Wireless Earbuds,Electronics,299.00,180.00
5002,Insulated Mug,Home,159.00,80.00
5003,Smart Watch,Electronics,899.00,550.00
5004,Running Shoes,Apparel,459.00,280.00
5005,Tablet Stand,Electronics,1299.00,800.00
"""
# Write temp files
os.makedirs("day07_data", exist_ok=True)
with open("day07_data/orders.csv", "w") as f:
f.write(orders_csv)
with open("day07_data/users.csv", "w") as f:
f.write(users_csv)
with open("day07_data/products.csv", "w") as f:
f.write(products_csv)
# ====== Step 2: Cross-Source JOIN with DuckDB ======
con = duckdb.connect()
# Mount CSV files as views — simulating "different data sources"
con.execute("""
CREATE VIEW orders AS
SELECT * FROM read_csv_auto('day07_data/orders.csv')
""")
con.execute("""
CREATE VIEW users AS
SELECT * FROM read_csv_auto('day07_data/users.csv')
""")
con.execute("""
CREATE VIEW products AS
SELECT * FROM read_csv_auto('day07_data/products.csv')
""")
print("=" * 60)
print("📊 Cross-Source: What did Gold members buy?")
print("=" * 60)
# One SQL across three "data sources"
result = con.execute("""
SELECT
u.name AS customer_name,
u.city,
u.member_level,
p.product_name,
p.category,
o.amount AS unit_price,
(o.amount - p.cost_price) AS gross_margin
FROM orders o
JOIN users u ON o.customer_id = u.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE u.member_level IN ('Gold', 'Silver')
ORDER BY o.amount DESC
""").fetchdf()
print(result.to_string(index=False))
print("\n" + "=" * 60)
print("💰 Daily Sales Summary (with Margins)")
print("=" * 60)
result2 = con.execute("""
SELECT
o.order_date AS date,
COUNT(DISTINCT o.order_id) AS orders,
SUM(o.amount) AS revenue,
SUM(o.amount - p.cost_price) AS gross_profit,
ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit_per_order
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY o.order_date
ORDER BY o.order_date
""").fetchdf()
print(result2.to_string(index=False))
print("\n" + "=" * 60)
print("🏆 Category Profitability Analysis")
print("=" * 60)
result3 = con.execute("""
SELECT
p.category,
COUNT(*) AS units_sold,
SUM(o.amount) AS revenue,
SUM(o.amount - p.cost_price) AS gross_profit,
ROUND(AVG(o.amount - p.cost_price), 2) AS avg_profit,
ROUND(SUM(o.amount - p.cost_price) / SUM(o.amount) * 100, 1) AS margin_pct
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY p.category
ORDER BY gross_profit DESC
""").fetchdf()
print(result3.to_string(index=False))
# ====== Step 3: Export to Excel ======
try:
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("""
COPY (
SELECT
o.order_id,
u.name,
u.city,
u.member_level,
p.product_name,
p.category,
o.amount,
(o.amount - p.cost_price) AS profit
FROM orders o
JOIN users u ON o.customer_id = u.customer_id
JOIN products p ON o.product_id = p.product_id
) TO 'day07_data/cross_source_report.xlsx'
WITH (FORMER XLSX);
""")
print("\n✅ Report exported: day07_data/cross_source_report.xlsx")
except Exception as e:
print(f"\n⚠️ XLSX export needs spatial extension: {e}")
print("Falling back to CSV:")
con.execute("""
COPY (
SELECT
o.order_id,
u.name,
u.city,
u.member_level,
p.product_name,
p.category,
o.amount,
(o.amount - p.cost_price) AS profit
FROM orders o
JOIN users u ON o.customer_id = u.customer_id
JOIN products p ON o.product_id = p.product_id
) TO 'day07_data/cross_source_report.csv' (HEADER, DELIMITER ',');
""")
print("✅ Report exported: day07_data/cross_source_report.csv")
# ====== Step 4: Validation ======
print("\n" + "=" * 60)
print("📈 Validation: Are all orders matched to users and products?")
print("=" * 60)
validation = con.execute("""
SELECT
'Total orders' AS metric, CAST(COUNT(*) AS VARCHAR) AS value FROM orders
UNION ALL
SELECT 'Matched to users', CAST(COUNT(*) AS VARCHAR) FROM orders o
JOIN users u ON o.customer_id = u.customer_id
UNION ALL
SELECT 'Matched to products', CAST(COUNT(*) AS VARCHAR) FROM orders o
JOIN products p ON o.product_id = p.product_id
UNION ALL
SELECT 'Fully matched', CAST(COUNT(*) AS VARCHAR) FROM orders o
JOIN users u ON o.customer_id = u.customer_id
JOIN products p ON o.product_id = p.product_id
""").fetchdf()
print(validation.to_string(index=False))
con.close()
print("\n🎉 Cross-database JOIN demo complete!")
2.3 Real Environment: Connecting to MySQL + PostgreSQL
When you have actual MySQL and PostgreSQL databases, the script becomes:
-- Install extensions (one-time)
INSTALL mysql_scanner;
LOAD mysql_scanner;
INSTALL postgres_scanner;
LOAD postgres_scanner;
-- Mount MySQL orders database
ATTACH 'host=localhost port=3306 dbname=orders_db user=analyst password=xxx'
AS mysql_db (TYPE MYSQL);
-- Mount PostgreSQL user database
ATTACH 'host=localhost port=5432 dbname=users_db user=analyst password=xxx'
AS pg_db (TYPE POSTGRES);
-- Mount local product CSV
CREATE VIEW products AS
SELECT * FROM read_csv_auto('products.csv');
-- One SQL across all three
SELECT
u.name,
u.city,
p.product_name,
SUM(o.amount) AS total_spent
FROM mysql_db.orders AS o
JOIN pg_db.public.customers AS u ON o.customer_id = u.customer_id
JOIN products AS p ON o.product_id = p.product_id
WHERE o.order_date >= '2026-04-01'
GROUP BY u.name, u.city, p.product_name
ORDER BY total_spent DESC;
3. Performance Comparison: Traditional vs. DuckDB
| Scenario | Traditional (Export + VLOOKUP) | DuckDB ATTACH |
|---|---|---|
| Cross MySQL + PG + CSV query | 30 min ~ 1 hour | 10 ~ 30 seconds |
| Opening 500K rows in Excel | Freeze/crash | Sub-second results |
| Changing analysis dimensions | Re-export + re-VLOOKUP | Edit one line of SQL |
| Scheduled report generation | Manual repeat every time | One-click script |
| 10GB+ dataset | Excel/Pandas OOM | Streaming, no pressure |
| Learning curve | Know VLOOKUP | Know standard SQL |
| Dependencies | Excel + multiple database clients | Just DuckDB |
| Reusability | Essentially zero | SQL script forever reusable |
Quantified impact:
A real case — an e-commerce company needing daily cross-source operations reports:
- Traditional: Data analyst spends 40 minutes daily exporting, merging, checking
- DuckDB: Write SQL once, daily execution 15 seconds
- Monthly time saved: 40 min × 22 workdays = 880 minutes (14.7 hours)
- Cost savings: At ¥50/hour, that’s ¥735/month per person
4. How ATTACH Works Under the Hood
Understanding ATTACH’s internals helps you design and optimize cross-database queries.
4.1 ATTACH Is NOT ETL
ATTACH doesn’t copy data into DuckDB — it creates an external table reference. When you query, DuckDB pushes query predicates down to the source database, only pulling back what’s needed.
-- DuckDB executes this GROUP BY on MySQL
-- Only the aggregated results come back, not a full table scan
SELECT customer_id, COUNT(*)
FROM mysql_db.orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;
DuckDB’s optimizer automatically pushes WHERE, GROUP BY, LIMIT operations down to the source database, minimizing data transfer.
4.2 Key Performance Factors
| Factor | Description | Optimization |
|---|---|---|
| Network latency | Cross-db queries depend on network | Deploy DuckDB near your databases |
| Pushdown optimization | Aggregations/filters execute remotely | Use WHERE to reduce data volume |
| Index utilization | Source indexes still work | Index your JOIN columns |
| Data volume | DuckDB doesn’t cache external tables | Use CREATE TABLE AS for repeated queries |
4.3 Performance Tips
-- ❌ BAD: Pull everything, filter later
SELECT * FROM mysql_db.orders; -- Could be millions of rows
-- ✅ GOOD: Push down filter + limit
SELECT * FROM mysql_db.orders
WHERE order_date >= '2026-05-01'
LIMIT 1000;
-- ✅ BEST: Aggregate first, then JOIN
WITH daily_stats AS (
-- This aggregation runs on MySQL
SELECT customer_id, DATE(order_date) AS day, SUM(amount) AS daily_total
FROM mysql_db.orders
WHERE order_date >= '2026-04-01'
GROUP BY customer_id, DATE(order_date)
)
-- Then JOIN with local data
SELECT u.name, d.day, d.daily_total
FROM daily_stats d
JOIN pg_db.public.customers u ON d.customer_id = u.customer_id
ORDER BY d.daily_total DESC
LIMIT 20;
5. Advanced Use Cases
5.1 Data Migration: Cross-Database Copy
-- MySQL → DuckDB local table (one-time snapshot)
CREATE TABLE local_orders AS
SELECT * FROM mysql_db.orders
WHERE order_date >= '2026-01-01';
-- DuckDB → PostgreSQL (write back)
CREATE TABLE pg_db.public.report AS
SELECT * FROM local_analytics;
5.2 Multi-Environment Comparison
-- Mount production and staging simultaneously
ATTACH 'prod_conn' AS prod (TYPE POSTGRES);
ATTACH 'staging_conn' AS staging (TYPE POSTGRES);
-- Compare data differences
SELECT
COALESCE(p.order_id, s.order_id) AS order_id,
p.amount AS prod_amount,
s.amount AS staging_amount,
(p.amount - s.amount) AS diff
FROM prod.public.orders p
FULL OUTER JOIN staging.public.orders s
ON p.order_id = s.order_id
WHERE p.amount IS DISTINCT FROM s.amount;
5.3 Automated Scheduled Reports
# Cron-based daily report generation
import duckdb
con = duckdb.connect()
# ATTACH data sources
con.execute("ATTACH '...' AS mysql_db (TYPE MYSQL)")
con.execute("ATTACH '...' AS pg_db (TYPE POSTGRES)")
# Generate daily report
con.execute("""
COPY (
-- cross-source query...
) TO '/tmp/daily_report.csv' (HEADER, DELIMITER ',');
""")
# Send email via SMTP (pseudo-code)
# send_email(to='boss@company.com', attachment='/tmp/daily_report.csv')
print("✅ Daily report generated")
6. Connection Guide & Troubleshooting
6.1 MySQL Extension Setup
# Ensure MySQL client libraries are installed
apt-get install -y default-libmysqlclient-dev # Ubuntu/Debian
# or
brew install mysql-client # macOS
Then in DuckDB:
INSTALL mysql_scanner;
LOAD mysql_scanner;
6.2 Connection String Formats
| Source | Connection String Example |
|---|---|
| MySQL | host=localhost port=3306 dbname=test user=root password=secret |
| PostgreSQL | host=localhost port=5432 dbname=test user=postgres password=secret |
| SQLite | ./data.db (just the file path) |
6.3 Common Pitfalls
- MySQL 8.0 auth: Use
mysql_native_passwordor update DuckDB to v1.5+ forcaching_sha2_password - PostgreSQL SSL: Add
sslmode=requireparameter - Large table JOINs: If both sides are large, pull the smaller table into DuckDB locally first
- Character encoding: Default is UTF-8; MySQL
latin1may produce garbled text
7. Monetization Strategy
This skill has significant market value because 99% of companies have data silo problems.
Target Customers
- SMBs: Data scattered across multiple systems, no dedicated data team
- E-commerce: Order system + CRM + finance — all independent
- Retail chains: Each store + HQ + supply chain — different data sources
- Traditional enterprises in transition: Legacy databases coexisting with new systems
Pricing
| Service | Price | Deliverables | Timeline |
|---|---|---|---|
| One-time data integration | ¥2,000-5,000 ($280-700) | Cross-source query scripts + Excel report template | 1-3 days |
| Monthly report automation | ¥500-1,500/month ($70-210) | Scheduled cross-source business reports | Monthly |
| Data warehouse setup | ¥5,000-15,000 ($700-2,100) | Complete ETL pipeline + analytics dashboard | 1-2 weeks |
| Data integration training | ¥1,500-3,000/session ($210-420) | Teach team to use DuckDB themselves | Half day |
Competitive Landscape
| Solution | Price | Strength | Weakness |
|---|---|---|---|
| Traditional ETL (Kettle/DataX) | Free but needs ops | Full-featured | Complex configuration, steep learning curve |
| Commercial BI (Tableau/Power BI) | ¥500-2,000/month | Great visualization | Expensive, weak cross-source capability |
| Hiring a manual analyst | ¥300-500/month | No thinking required | Unreliable, churn risk |
| DuckDB Solution (You) | ¥2,000-5,000 | One-time build, permanent use | Requires basic technical client |
Client Acquisition
- Freelance platforms (Upwork, Fiverr): Search “data integration,” “cross-database query,” “report automation” — pitch DuckDB solutions
- Industry communities: Join e-commerce, retail, or operations groups — ask “how do you generate your reports?”
- This blog post: Share it as proof of expertise. Every time someone reads it, they know you can solve their data silo problem.
Sales Pitch Template
“I see your company has data spread across different systems — you probably spend hours manually merging each report. I have a solution that connects all your data sources with one SQL query. After setup, you click once and get a complete report. Integration costs ¥3,000, then ¥800/month for automated monthly reports. Interested in a free data assessment first?”
8. Conclusion
DuckDB’s ATTACH + cross-database JOIN capability is this database’s most underrated killer feature. It frees data analysts from the primitive “export → merge → VLOOKUP” workflow, compressing cross-source query time from hours to seconds.
More importantly, enterprise data silos are a universal pain point, and DuckDB provides a low-cost, easy-to-learn, immediately effective solution. Master this skill, and you can solve real-world integration problems — at a clear, billable price.
What to do today:
- Run the demo script above to understand ATTACH syntax
- Find the most fragmented data scenario in your company
- Connect it with DuckDB and show the result to your boss
- Use this case study to win external clients
Cleanup:
rm -rf day07_data/