The Complete Guide to Importing CSV in DuckDB: Best Practices for 10 Scenarios

The Complete Guide to Importing CSV in DuckDB: Best Practices for 10 Scenarios

DuckDB CSV import guide covering 10 scenarios: auto-detection, manual schema, delimiters, encoding, large files, remote URLs, error handling, date parsing, and Parquet conversion.

Introduction

CSV is the most common data format in daily analytics work. It’s simple, universal, and almost every system can export it. But getting CSV imported into a database correctly and efficiently is never as trivial as it seems.

You’ve probably hit these issues:

DuckDB auto-detects data types, but dates end up as VARCHAR
A few bad rows crash the entire import
Million-row files take forever to import
Chinese column names or GBK encoding produce garbled text
Non-standard delimiters — tabs, semicolons, or pipes instead of commas

I’ve hit every single one. This guide covers 10 real-world CSV import scenarios with production-ready SQL and best practices.

1. read_csv_auto: Your Default Choice

DuckDB’s most convenient CSV reader is read_csv_auto. It auto-detects delimiter, header, data types, and encoding:

-- One-liner CSV read
SELECT * FROM read_csv_auto('data/sales.csv');

-- Create a table directly
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/sales.csv');

What read_csv_auto detects automatically:

Aspect	Description	Reliability
Delimiter	comma, tab, pipe, etc.	Excellent
Header	first row vs data rows	High
Data types	scans up to 20,000 rows	Moderate
Encoding	UTF-8/ASCII	Low (issues with CJK)
Quote char	double/single quotes	High

Pro tip: read_csv_auto works perfectly in 80% of cases. When it fails, don’t fight it — switch to read_csv with explicit parameters.

2. read_csv: When You Need Full Control

When auto-detection fails, use read_csv to specify format parameters:

-- Explicit delimiter, encoding, header
SELECT * FROM read_csv(
  'data/sales.csv',
  delim='|',
  header=true,
  encoding='utf-8',
  quote='"',
  escape='"'
);

Key parameters:

Parameter	Default	Description
`delim`	`,`	Field delimiter
`header`	`true`	Is first row column names
`encoding`	`utf-8`	File encoding
`quote`	`"`	Quote character
`escape`	`"`	Escape character
`nullstr`	`''`	Null string (e.g. `NULL`, `NA`)
`skip`	`0`	Lines to skip at start
`sample_size`	`-1`	Type inference sample rows

3. Handling Bad Rows: Don’t Lose Data

Real-world data is never clean. An extra comma, embedded newlines in a field, text sneaking into a numeric column — any of these can crash read_csv_auto.

DuckDB provides ignore_errors=true:

-- Skip bad rows silently
SELECT * FROM read_csv(
  'data/messy_sales.csv',
  header=true,
  ignore_errors=true
);

But blindly skipping bad rows is a bad habit — you might lose 10% of your data without noticing. Better practice: inspect the error rows separately:

-- Capture errors to a separate file (DuckDB v1.5+)
CREATE TABLE clean_sales AS
SELECT * FROM read_csv(
  'data/messy_sales.csv',
  header=true,
  reject_errors=true,
  file_errors='errors.csv'
);

4. Specifying Data Types: Stop Guessing

The most common pitfall: DuckDB reads numbers as VARCHAR, or dates become strings. Usually because the first few thousand rows look clean, but later rows contain anomalies.

Declare the schema manually:

SELECT * FROM read_csv(
  'data/sales.csv',
  header=true,
  columns={
    'order_id': 'INTEGER',
    'product': 'VARCHAR',
    'amount': 'DECIMAL(10,2)',
    'quantity': 'INTEGER',
    'order_date': 'DATE',
    'city': 'VARCHAR'
  }
);

Benefits of explicit schema:

Zero surprises — types are fully under your control
Faster — no need to scan rows for type inference
Cleaner null handling — "N/A" in a DECIMAL column becomes NULL

Date Parsing

DuckDB natively supports ISO dates (2026-01-15). For custom formats:

-- Read as VARCHAR, then parse
SELECT *,
  strptime(order_date, '%m/%d/%Y') AS parsed_date
FROM read_csv(
  'data/sales.csv',
  columns={'order_id': 'INTEGER', 'order_date': 'VARCHAR', 'amount': 'DECIMAL(10,2)'}
);

5. Common CSV Pitfalls

5.1 Comments or copyright headers

-- Skip first 3 lines
SELECT * FROM read_csv('data/sales.csv', skip=3);

5.2 Custom null values

SELECT * FROM read_csv(
  'data/sales.csv',
  nullstr=['NULL', 'NA', '-', '']
);

5.3 No header row

SELECT * FROM read_csv('data/no_header.csv', header=false,
  columns={'name': 'VARCHAR', 'age': 'INTEGER'});

6. Reading from a URL

One of DuckDB’s most powerful features — load CSV directly from a remote URL:

-- From GitHub raw
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/.../data.csv');

-- From S3 (requires httpfs extension)
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv_auto('s3://my-bucket/data/sales.csv');

7. Performance Tips for Large Files

7.1 Disable auto-detection

Auto-detection scans rows (default 20,000) for type inference. For million-row files, this adds seconds:

-- Fast: explicit schema, no inference
SELECT * FROM read_csv(
  'data/large_file.csv',
  columns={'id': 'INTEGER', 'name': 'VARCHAR'},
  auto_detect=false
);

Benchmark (1M rows, 50MB CSV):

Method	Time
`read_csv_auto` default	3.2s
`read_csv` with schema	1.1s

7.2 Import directly to a table

CREATE TABLE large_data AS
SELECT * FROM read_csv(
  'data/large_file.csv',
  columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

8. Convert to Parquet Immediately

CSV is a transport format, not an analysis format. Once imported, export to Parquet immediately:

-- Import CSV
CREATE TABLE raw_data AS
SELECT * FROM read_csv('data/large.csv', columns={...});

-- Export to Parquet
COPY raw_data TO 'data/large.parquet' (FORMAT PARQUET);

Parquet is 5-10x smaller and 10x faster to query.

FAQ

Q1: Out of memory with large files?

SET memory_limit = '2GB';
SET temp_directory = '/tmp/duckdb_tmp';

Q2: GBK-encoded Chinese CSV?

SELECT * FROM read_csv_auto('data/cn_sales.csv', encoding='gbk');

Or convert via CLI: iconv -f GBK -t UTF-8 input.csv > output_utf8.csv

Q3: Preview first N rows?

SELECT * FROM read_csv_auto('data/sales.csv') LIMIT 100;

Summary

DuckDB’s CSV import capabilities range from “quick preview” to “production ETL”:

80% cases → read_csv_auto — zero config
15% cases → read_csv — explicit delimiter/encoding/schema
5% cases → ignore_errors + reject_errors — dirty data
Always convert to Parquet — 10x faster queries later

One golden rule: when encountering a new CSV, always preview with LIMIT 100 first. This prevents 90% of import issues.

📺 Watch video tutorials → Olap Studio YouTube

Subscribe for more DuckDB & AI automation tutorials

⚠️ This site is an independent community project, not affiliated with, endorsed by, or sponsored by the DuckDB Foundation or official DuckDB project.

"DuckDB" is a registered trademark of the DuckDB Foundation. This site uses the name solely for factual description purposes.

All content is for educational and community promotion purposes only and does not constitute any commercial service.