Introduction
CSV is the most common data format in daily analytics work. It’s simple, universal, and almost every system can export it. But getting CSV imported into a database correctly and efficiently is never as trivial as it seems.
You’ve probably hit these issues:
- DuckDB auto-detects data types, but dates end up as VARCHAR
- A few bad rows crash the entire import
- Million-row files take forever to import
- Chinese column names or GBK encoding produce garbled text
- Non-standard delimiters — tabs, semicolons, or pipes instead of commas
I’ve hit every single one. This guide covers 10 real-world CSV import scenarios with production-ready SQL and best practices.
1. read_csv_auto: Your Default Choice
DuckDB’s most convenient CSV reader is read_csv_auto. It auto-detects delimiter, header, data types, and encoding:
-- One-liner CSV read
SELECT * FROM read_csv_auto('data/sales.csv');
-- Create a table directly
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/sales.csv');
What read_csv_auto detects automatically:
| Aspect | Description | Reliability |
|---|---|---|
| Delimiter | comma, tab, pipe, etc. | Excellent |
| Header | first row vs data rows | High |
| Data types | scans up to 20,000 rows | Moderate |
| Encoding | UTF-8/ASCII | Low (issues with CJK) |
| Quote char | double/single quotes | High |
Pro tip:
read_csv_autoworks perfectly in 80% of cases. When it fails, don’t fight it — switch toread_csvwith explicit parameters.
2. read_csv: When You Need Full Control
When auto-detection fails, use read_csv to specify format parameters:
-- Explicit delimiter, encoding, header
SELECT * FROM read_csv(
'data/sales.csv',
delim='|',
header=true,
encoding='utf-8',
quote='"',
escape='"'
);
Key parameters:
| Parameter | Default | Description |
|---|---|---|
delim | , | Field delimiter |
header | true | Is first row column names |
encoding | utf-8 | File encoding |
quote | " | Quote character |
escape | " | Escape character |
nullstr | '' | Null string (e.g. NULL, NA) |
skip | 0 | Lines to skip at start |
sample_size | -1 | Type inference sample rows |
3. Handling Bad Rows: Don’t Lose Data
Real-world data is never clean. An extra comma, embedded newlines in a field, text sneaking into a numeric column — any of these can crash read_csv_auto.
DuckDB provides ignore_errors=true:
-- Skip bad rows silently
SELECT * FROM read_csv(
'data/messy_sales.csv',
header=true,
ignore_errors=true
);
But blindly skipping bad rows is a bad habit — you might lose 10% of your data without noticing. Better practice: inspect the error rows separately:
-- Capture errors to a separate file (DuckDB v1.5+)
CREATE TABLE clean_sales AS
SELECT * FROM read_csv(
'data/messy_sales.csv',
header=true,
reject_errors=true,
file_errors='errors.csv'
);
4. Specifying Data Types: Stop Guessing
The most common pitfall: DuckDB reads numbers as VARCHAR, or dates become strings. Usually because the first few thousand rows look clean, but later rows contain anomalies.
Declare the schema manually:
SELECT * FROM read_csv(
'data/sales.csv',
header=true,
columns={
'order_id': 'INTEGER',
'product': 'VARCHAR',
'amount': 'DECIMAL(10,2)',
'quantity': 'INTEGER',
'order_date': 'DATE',
'city': 'VARCHAR'
}
);
Benefits of explicit schema:
- Zero surprises — types are fully under your control
- Faster — no need to scan rows for type inference
- Cleaner null handling —
"N/A"in a DECIMAL column becomes NULL
Date Parsing
DuckDB natively supports ISO dates (2026-01-15). For custom formats:
-- Read as VARCHAR, then parse
SELECT *,
strptime(order_date, '%m/%d/%Y') AS parsed_date
FROM read_csv(
'data/sales.csv',
columns={'order_id': 'INTEGER', 'order_date': 'VARCHAR', 'amount': 'DECIMAL(10,2)'}
);
5. Common CSV Pitfalls
5.1 Comments or copyright headers
-- Skip first 3 lines
SELECT * FROM read_csv('data/sales.csv', skip=3);
5.2 Custom null values
SELECT * FROM read_csv(
'data/sales.csv',
nullstr=['NULL', 'NA', '-', '']
);
5.3 No header row
SELECT * FROM read_csv('data/no_header.csv', header=false,
columns={'name': 'VARCHAR', 'age': 'INTEGER'});
6. Reading from a URL
One of DuckDB’s most powerful features — load CSV directly from a remote URL:
-- From GitHub raw
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/.../data.csv');
-- From S3 (requires httpfs extension)
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv_auto('s3://my-bucket/data/sales.csv');
7. Performance Tips for Large Files
7.1 Disable auto-detection
Auto-detection scans rows (default 20,000) for type inference. For million-row files, this adds seconds:
-- Fast: explicit schema, no inference
SELECT * FROM read_csv(
'data/large_file.csv',
columns={'id': 'INTEGER', 'name': 'VARCHAR'},
auto_detect=false
);
Benchmark (1M rows, 50MB CSV):
| Method | Time |
|---|---|
read_csv_auto default | 3.2s |
read_csv with schema | 1.1s |
7.2 Import directly to a table
CREATE TABLE large_data AS
SELECT * FROM read_csv(
'data/large_file.csv',
columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);
8. Convert to Parquet Immediately
CSV is a transport format, not an analysis format. Once imported, export to Parquet immediately:
-- Import CSV
CREATE TABLE raw_data AS
SELECT * FROM read_csv('data/large.csv', columns={...});
-- Export to Parquet
COPY raw_data TO 'data/large.parquet' (FORMAT PARQUET);
Parquet is 5-10x smaller and 10x faster to query.
FAQ
Q1: Out of memory with large files?
SET memory_limit = '2GB';
SET temp_directory = '/tmp/duckdb_tmp';
Q2: GBK-encoded Chinese CSV?
SELECT * FROM read_csv_auto('data/cn_sales.csv', encoding='gbk');
Or convert via CLI: iconv -f GBK -t UTF-8 input.csv > output_utf8.csv
Q3: Preview first N rows?
SELECT * FROM read_csv_auto('data/sales.csv') LIMIT 100;
Summary
DuckDB’s CSV import capabilities range from “quick preview” to “production ETL”:
- 80% cases →
read_csv_auto— zero config - 15% cases →
read_csv— explicit delimiter/encoding/schema - 5% cases →
ignore_errors+reject_errors— dirty data - Always convert to Parquet — 10x faster queries later
One golden rule: when encountering a new CSV, always preview with LIMIT 100 first. This prevents 90% of import issues.
