Featured image of post The Complete Guide to Importing CSV in DuckDB: Best Practices for 10 Scenarios

The Complete Guide to Importing CSV in DuckDB: Best Practices for 10 Scenarios

DuckDB CSV import guide covering 10 scenarios: auto-detection, manual schema, delimiters, encoding, large files, remote URLs, error handling, date parsing, and Parquet conversion.

Introduction

CSV is the most common data format in daily analytics work. It’s simple, universal, and almost every system can export it. But getting CSV imported into a database correctly and efficiently is never as trivial as it seems.

You’ve probably hit these issues:

  • DuckDB auto-detects data types, but dates end up as VARCHAR
  • A few bad rows crash the entire import
  • Million-row files take forever to import
  • Chinese column names or GBK encoding produce garbled text
  • Non-standard delimiters — tabs, semicolons, or pipes instead of commas

I’ve hit every single one. This guide covers 10 real-world CSV import scenarios with production-ready SQL and best practices.

1. read_csv_auto: Your Default Choice

DuckDB’s most convenient CSV reader is read_csv_auto. It auto-detects delimiter, header, data types, and encoding:

-- One-liner CSV read
SELECT * FROM read_csv_auto('data/sales.csv');

-- Create a table directly
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('data/sales.csv');

What read_csv_auto detects automatically:

AspectDescriptionReliability
Delimitercomma, tab, pipe, etc.Excellent
Headerfirst row vs data rowsHigh
Data typesscans up to 20,000 rowsModerate
EncodingUTF-8/ASCIILow (issues with CJK)
Quote chardouble/single quotesHigh

Pro tip: read_csv_auto works perfectly in 80% of cases. When it fails, don’t fight it — switch to read_csv with explicit parameters.

2. read_csv: When You Need Full Control

When auto-detection fails, use read_csv to specify format parameters:

-- Explicit delimiter, encoding, header
SELECT * FROM read_csv(
  'data/sales.csv',
  delim='|',
  header=true,
  encoding='utf-8',
  quote='"',
  escape='"'
);

Key parameters:

ParameterDefaultDescription
delim,Field delimiter
headertrueIs first row column names
encodingutf-8File encoding
quote"Quote character
escape"Escape character
nullstr''Null string (e.g. NULL, NA)
skip0Lines to skip at start
sample_size-1Type inference sample rows

3. Handling Bad Rows: Don’t Lose Data

Real-world data is never clean. An extra comma, embedded newlines in a field, text sneaking into a numeric column — any of these can crash read_csv_auto.

DuckDB provides ignore_errors=true:

-- Skip bad rows silently
SELECT * FROM read_csv(
  'data/messy_sales.csv',
  header=true,
  ignore_errors=true
);

But blindly skipping bad rows is a bad habit — you might lose 10% of your data without noticing. Better practice: inspect the error rows separately:

-- Capture errors to a separate file (DuckDB v1.5+)
CREATE TABLE clean_sales AS
SELECT * FROM read_csv(
  'data/messy_sales.csv',
  header=true,
  reject_errors=true,
  file_errors='errors.csv'
);

4. Specifying Data Types: Stop Guessing

The most common pitfall: DuckDB reads numbers as VARCHAR, or dates become strings. Usually because the first few thousand rows look clean, but later rows contain anomalies.

Declare the schema manually:

SELECT * FROM read_csv(
  'data/sales.csv',
  header=true,
  columns={
    'order_id': 'INTEGER',
    'product': 'VARCHAR',
    'amount': 'DECIMAL(10,2)',
    'quantity': 'INTEGER',
    'order_date': 'DATE',
    'city': 'VARCHAR'
  }
);

Benefits of explicit schema:

  1. Zero surprises — types are fully under your control
  2. Faster — no need to scan rows for type inference
  3. Cleaner null handling"N/A" in a DECIMAL column becomes NULL

Date Parsing

DuckDB natively supports ISO dates (2026-01-15). For custom formats:

-- Read as VARCHAR, then parse
SELECT *,
  strptime(order_date, '%m/%d/%Y') AS parsed_date
FROM read_csv(
  'data/sales.csv',
  columns={'order_id': 'INTEGER', 'order_date': 'VARCHAR', 'amount': 'DECIMAL(10,2)'}
);

5. Common CSV Pitfalls

-- Skip first 3 lines
SELECT * FROM read_csv('data/sales.csv', skip=3);

5.2 Custom null values

SELECT * FROM read_csv(
  'data/sales.csv',
  nullstr=['NULL', 'NA', '-', '']
);

5.3 No header row

SELECT * FROM read_csv('data/no_header.csv', header=false,
  columns={'name': 'VARCHAR', 'age': 'INTEGER'});

6. Reading from a URL

One of DuckDB’s most powerful features — load CSV directly from a remote URL:

-- From GitHub raw
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/.../data.csv');

-- From S3 (requires httpfs extension)
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv_auto('s3://my-bucket/data/sales.csv');

7. Performance Tips for Large Files

7.1 Disable auto-detection

Auto-detection scans rows (default 20,000) for type inference. For million-row files, this adds seconds:

-- Fast: explicit schema, no inference
SELECT * FROM read_csv(
  'data/large_file.csv',
  columns={'id': 'INTEGER', 'name': 'VARCHAR'},
  auto_detect=false
);

Benchmark (1M rows, 50MB CSV):

MethodTime
read_csv_auto default3.2s
read_csv with schema1.1s

7.2 Import directly to a table

CREATE TABLE large_data AS
SELECT * FROM read_csv(
  'data/large_file.csv',
  columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

8. Convert to Parquet Immediately

CSV is a transport format, not an analysis format. Once imported, export to Parquet immediately:

-- Import CSV
CREATE TABLE raw_data AS
SELECT * FROM read_csv('data/large.csv', columns={...});

-- Export to Parquet
COPY raw_data TO 'data/large.parquet' (FORMAT PARQUET);

Parquet is 5-10x smaller and 10x faster to query.

FAQ

Q1: Out of memory with large files?

SET memory_limit = '2GB';
SET temp_directory = '/tmp/duckdb_tmp';

Q2: GBK-encoded Chinese CSV?

SELECT * FROM read_csv_auto('data/cn_sales.csv', encoding='gbk');

Or convert via CLI: iconv -f GBK -t UTF-8 input.csv > output_utf8.csv

Q3: Preview first N rows?

SELECT * FROM read_csv_auto('data/sales.csv') LIMIT 100;

Summary

DuckDB’s CSV import capabilities range from “quick preview” to “production ETL”:

  1. 80% cases → read_csv_auto — zero config
  2. 15% cases → read_csv — explicit delimiter/encoding/schema
  3. 5% cases → ignore_errors + reject_errors — dirty data
  4. Always convert to Parquet — 10x faster queries later

One golden rule: when encountering a new CSV, always preview with LIMIT 100 first. This prevents 90% of import issues.

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy