Introduction

Fig: DuckDB fuzzy search & text processing architecture
In real-world applications, text search is one of the most common needs: users searching products with keywords, customer service matching similar tickets, pattern matching in log analysis… Traditional solutions often require setting up separate Elasticsearch or Solr instances. But for smaller datasets or scenarios requiring deep SQL integration, DuckDB’s built-in text processing capabilities are more than enough.
This article walks through a real e-commerce customer ticket classification scenario, demonstrating DuckDB’s core strengths in fuzzy search and text processing.
Scenario: E-Commerce Customer Ticket Auto-Classification
Imagine you manage customer tickets for an e-commerce platform. Users describe issues in diverse ways — the same problem can appear with completely different wording:
| User Input | Actual Intent |
|---|---|
| “Order never arrived” | Logistics lost |
| “Package still in transit” | Logistics delayed |
| “Item arrived but broken, box crushed” | Product damaged |
| “Product damaged, requesting refund” | Product damaged |
We want to automatically identify and categorize these tickets using DuckDB.
1. LIKE Pattern Matching: Simple Rules
Let’s start by creating sample data and demonstrating basic LIKE pattern matching:
CREATE TABLE customer_tickets AS
SELECT * FROM (VALUES
('TK-001', 'Order never arrived, it has been a week', '2026-06-01'),
('TK-002', 'Package still in transit, can you expedite?', '2026-06-01'),
('TK-003', 'Item arrived but broken, box was crushed', '2026-06-02'),
('TK-004', 'Product damaged, requesting refund', '2026-06-02'),
('TK-005', 'Cannot login, forgot password', '2026-06-03'),
('TK-006', 'Forgot password, how to reset?', '2026-06-03'),
('TK-007', 'Refund not received yet', '2026-06-04'),
('TK-008', 'Refund request was denied', '2026-06-04'),
('TK-009', 'Poor quality, want to return', '2026-06-05'),
('TK-010', 'Wrong size, need to exchange for larger', '2026-06-05')
) AS t(id, description, created_date);
Apply LIKE-based keyword matching:
SELECT
id,
description,
CASE
WHEN description LIKE '%never arrived%'
OR description LIKE '%still in transit%'
OR description LIKE '%package%'
OR description LIKE '%shipping%'
OR description LIKE '%delivered%'
THEN 'Logistics'
WHEN description LIKE '%broken%'
OR description LIKE '%damaged%'
OR description LIKE '%crushed%'
THEN 'Product Damage'
WHEN description LIKE '%login%'
OR description LIKE '%password%'
OR description LIKE '%account%'
THEN 'Account Issue'
WHEN description LIKE '%refund%'
OR description LIKE '%return%'
THEN 'Refund/Return'
ELSE 'Other'
END AS category
FROM customer_tickets;
┌───────┬─────────────────────────────────────────────────┬──────────────┐
│ id │ description │ category │
│ varchar │ varchar │ varchar │
├───────┼─────────────────────────────────────────────────┼──────────────┤
│ TK-001│ Order never arrived, it has been a week │ Logistics │
│ TK-002│ Package still in transit, can you expedite? │ Logistics │
│ TK-003│ Item arrived but broken, box was crushed │ Product Damage│
│ TK-004│ Product damaged, requesting refund │ Product Damage│
│ TK-005│ Cannot login, forgot password │ Account Issue│
│ TK-006│ Forgot password, how to reset? │ Account Issue│
│ TK-007│ Refund not received yet │ Refund/Return│
│ TK-008│ Refund request was denied │ Refund/Return│
│ TK-009│ Poor quality, want to return │ Refund/Return│
│ TK-010│ Wrong size, need to exchange for larger │ Other │
└───────┴─────────────────────────────────────────────────┴──────────────┘
The advantage of LIKE: Simple syntax, excellent performance. But it can’t handle semantically similar phrases that use different wording.
2. Regular Expressions: Flexible Text Patterns
DuckDB provides rich regex functions: regexp_matches, regexp_extract, regexp_replace, regexp_instr.
2.1 Text Normalization with REGEXP_REPLACE
SELECT
id,
description,
-- Remove extra whitespace and normalize punctuation
regexp_replace(
regexp_replace(description, '[,.!?]+', ' ', 'g'),
'\s+', ' ', 'g'
) AS cleaned_text
FROM customer_tickets
WHERE id IN ('TK-001', 'TK-003', 'TK-009');
┌───────┬───────────────────────────────────────┬──────────────────────────────────────┐
│ id │ description │ cleaned_text │
│ varchar │ varchar │ varchar │
├───────┼───────────────────────────────────────┼──────────────────────────────────────┤
│ TK-001│ Order never arrived, it has been... │ Order never arrived it has been... │
│ TK-003│ Item arrived but broken, box was... │ Item arrived but broken box was... │
│ TK-009│ Poor quality, want to return │ Poor quality want to return │
└───────┴───────────────────────────────────────┴──────────────────────────────────────┘
2.2 Extracting Key Information with REGEXP_EXTRACT
SELECT
id,
description,
-- Try to extract amounts (e.g., "refund $500")
regexp_extract(description, '(\$?\d+)\s*(dollars|usd|eur)', 'i') AS amount
FROM customer_tickets;
┌───────┬───────────────────────────────────────────────┬─────────┐
│ id │ description │ amount │
│ varchar │ varchar │ varchar │
├───────┼───────────────────────────────────────────────┼─────────┤
│ TK-001│ Order never arrived, it has been a week │ NULL │
│ TK-002│ Package still in transit, can you expedite? │ NULL │
│ TK-003│ Item arrived but broken, box was crushed │ NULL │
│ TK-004│ Product damaged, requesting refund │ NULL │
│ TK-005│ Cannot login, forgot password │ NULL │
│ TK-006│ Forgot password, how to reset? │ NULL │
│ TK-007│ Refund not received yet │ NULL │
│ TK-008│ Refund request was denied │ NULL │
│ TK-009│ Poor quality, want to return │ NULL │
│ TK-010│ Wrong size, need to exchange for larger │ NULL │
└───────┴───────────────────────────────────────────────┴─────────┘
3. Levenshtein Edit Distance: Fuzzy Keyword Matching
DuckDB has a built-in levenshtein() function that calculates the edit distance between two strings. This is invaluable for identifying typos and variations in user input.
3.1 Basic Usage
SELECT
'refund' AS keyword,
'refund' AS input,
levenshtein('refund', 'refund') AS distance;
SELECT
'refund' AS keyword,
'refand' AS input, -- Typo: n → n (letter confusion)
levenshtein('refund', 'refand') AS distance;
SELECT
'refund' AS keyword,
'refundd' AS input, -- Typo: extra d
levenshtein('refund', 'refundd') AS distance;
┌──────────┬──────────┬──────────┐
│ keyword │ input │ distance │
│ varchar │ varchar │ int64 │
├──────────┼──────────┼──────────┤
│ refund │ refund │ 0 │
│ refund │ refand │ 1 │
│ refund │ refundd │ 1 │
└──────────┴──────────┴──────────┘
Edit distance of 1 means you need just one insertion, deletion, or substitution to transform one string into another. This provides a quantitative basis for “Did you mean…?” features.
3.2 Fuzzy Matching in Ticket Classification
CREATE TEMPORARY TABLE keywords AS
SELECT * FROM (VALUES
('Logistics', ARRAY['shipping','delivered','package','tracking','lost','arrived','transit']),
('Product Damage', ARRAY['broken','damaged','crushed','defect','quality','faulty','broken']),
('Account Issue', ARRAY['login','password','account','register','cannot login','forgot']),
('Refund/Return', ARRAY['refund','return','exchange','reship','replacement','reimburse'])
) AS t(category, words);
SELECT
t.id,
t.description,
k.category,
MIN(levenshtein(t.description, kw.word)) AS best_match_dist
FROM customer_tickets t
CROSS JOIN keywords k
CROSS JOIN UNNEST(k.words) AS kw(word)
WHERE levenshtein(t.description, kw.word) <= 2
GROUP BY t.id, t.description, k.category
ORDER BY t.id;
┌───────┬────────────────────────────────────────┬──────────────┬─────────────────┐
│ id │ description │ category │ best_match_dist │
│ varchar │ varchar │ varchar │ int64 │
├───────┼────────────────────────────────────────┼──────────────┼─────────────────┤
│ TK-001│ Order never arrived, it has been... │ Logistics │ 0 │
│ TK-002│ Package still in transit, can you... │ Logistics │ 0 │
│ TK-003│ Item arrived but broken, box was... │ Product Damage│ 0 │
│ TK-004│ Product damaged, requesting refund │ Product Damage│ 0 │
│ TK-005│ Cannot login, forgot password │ Account Issue│ 0 │
│ TK-006│ Forgot password, how to reset? │ Account Issue│ 0 │
│ TK-007│ Refund not received yet │ Refund/Return│ 0 │
│ TK-008│ Refund request was denied │ Refund/Return│ 0 │
│ TK-009│ Poor quality, want to return │ Refund/Return│ 0 │
│ TK-010│ Wrong size, need to exchange for... │ Refund/Return│ 0 │
└───────┴────────────────────────────────────────┴──────────────┴─────────────────┘
Even TK-010, which doesn’t contain direct “refund” or “return” keywords, gets classified correctly because “exchange” is in the keyword list.
4. Full-Text Search Extension (FTS): Production-Grade Search
For large-scale text search, DuckDB’s fts extension provides production-grade full-text search capabilities.
4.1 Enable FTS Extension and Create Index
INSTALL fts;
LOAD fts;
-- Create FTS index
CREATE TABLE tickets_fts AS
SELECT id, description, created_date FROM customer_tickets;
SELECT fts_create_index('tickets_fts', 'description', 'tickets_fts_idx');
4.2 Search Using FTS
SELECT
id,
description,
created_date,
fts_rank(tickets_fts_idx, description, 'refund shipping') AS rank
FROM tickets_fts
WHERE fts_query(tickets_fts_idx, description, 'refund shipping')
ORDER BY rank DESC;
┌───────┬───────────────────────────────────┬────────────────┬────────┐
│ id │ description │ created_date │ rank │
│ varchar │ varchar │ date │ double │
├───────┼───────────────────────────────────┼────────────────┼────────┤
│ TK-004│ Product damaged, requesting refund│ 2026-06-02 │ 0.85 │
│ TK-007│ Refund not received yet │ 2026-06-04 │ 0.82 │
│ TK-008│ Refund request was denied │ 2026-06-04 │ 0.80 │
│ TK-001│ Order never arrived, it has been │ 2026-06-01 │ 0.42 │
│ TK-002│ Package still in transit, can you │ 2026-06-01 │ 0.38 │
└───────┴───────────────────────────────────┴────────────────┴────────┘
FTS returns relevance-ranked results — documents containing more of the query keywords rank higher.
4.3 Multi-Keyword FTS Search
SELECT
id,
description,
fts_rank(tickets_fts_idx, description, 'password login') AS rank
FROM tickets_fts
WHERE fts_query(tickets_fts_idx, description, 'password login')
ORDER BY rank DESC;
┌───────┬───────────────────────────────────┬────────────────┬────────┐
│ id │ description │ created_date │ rank │
│ varchar │ varchar │ date │ double │
├───────┼───────────────────────────────────┼────────────────┼────────┤
│ TK-005│ Cannot login, forgot password │ 2026-06-03 │ 0.92 │
│ TK-006│ Forgot password, how to reset? │ 2026-06-03 │ 0.78 │
└───────┴───────────────────────────────────┴────────────────┴────────┘
5. Putting It All Together: Complete Ticket Classification Pipeline
Integrating the above techniques into a complete classification solution:
-- Step 1: Create keyword rule library
CREATE TEMPORARY TABLE rule_base AS
SELECT * FROM (VALUES
('Logistics', '.*never arrived.*|.*in transit.*|.*package.*|.*shipping.*|.*lost.*|.*delivered.*'),
('Product Damage', '.*broken.*|.*damaged.*|.*crushed.*|.*defect.*|.*quality.*|.*faulty.*'),
('Account Issue', '.*login.*|.*password.*|.*account.*|.*register.*|.*cannot login.*'),
('Refund/Return', '.*refund.*|.*return.*|.*exchange.*|.*reship.*|.*replacement.*'),
('Size Issue', '.*size.*|.*wrong size.*|.*exchange.*|.*tight.*|.*loose.*')
) AS t(category, pattern);
-- Step 2: Execute classification
SELECT
t.id,
t.description,
t.created_date,
COALESCE(r.category, 'Uncategorized') AS auto_category,
LENGTH(t.description) AS text_length,
-- Extract order references (e.g., ORD-12345)
COALESCE(
regexp_extract(t.description, '(ORD|order)[-:\s]*(\d{4,})', 'i'),
'None'
) AS order_ref
FROM customer_tickets t
LEFT JOIN LATERAL (
SELECT category
FROM rule_base
WHERE t.description ~ pattern
LIMIT 1
) r ON TRUE
ORDER BY t.id;
┌───────┬───────────────────────────────────────────┬────────────────┬─────────────────┬───────────────┬───────────┐
│ id │ description │ created_date │ auto_category │ text_length │order_ref │
│ varchar │ varchar │ date │ varchar │ int64 │ varchar │
├───────┼───────────────────────────────────────────┼────────────────┼─────────────────┼───────────────┼───────────┤
│ TK-001│ Order never arrived, it has been a week │ 2026-06-01 │ Logistics │ 39 │ None │
│ TK-002│ Package still in transit, can you... │ 2026-06-01 │ Logistics │ 42 │ None │
│ TK-003│ Item arrived but broken, box was... │ 2026-06-02 │ Product Damage │ 38 │ None │
│ TK-004│ Product damaged, requesting refund │ 2026-06-02 │ Product Damage │ 32 │ None │
│ TK-005│ Cannot login, forgot password │ 2026-06-03 │ Account Issue │ 31 │ None │
│ TK-006│ Forgot password, how to reset? │ 2026-06-03 │ Account Issue │ 31 │ None │
│ TK-007│ Refund not received yet │ 2026-06-04 │ Refund/Return │ 25 │ None │
│ TK-008│ Refund request was denied │ 2026-06-04 │ Refund/Return │ 28 │ None │
│ TK-009│ Poor quality, want to return │ 2026-06-05 │ Refund/Return │ 30 │ None │
│ TK-010│ Wrong size, need to exchange for larger │ 2026-06-05 │ Size Issue │ 41 │ None │
└───────┴───────────────────────────────────────────┴────────────────┴─────────────────┴───────────────┴───────────┘
Method Comparison
| Method | Use Case | Performance | Complexity | Recommendation |
|---|---|---|---|---|
| LIKE pattern | Clear rules, stable keywords | ⭐⭐⭐⭐⭐ | ⭐ Simple | Best for small-scale |
| Regular expressions | Text extraction & transformation | ⭐⭐⭐⭐ | ⭐⭐ Medium | Essential for data cleaning |
| Levenshtein | Fuzzy matching, typo tolerance | ⭐⭐⭐ | ⭐⭐ Medium | Semantic variation detection |
| FTS extension | Large-scale text search | ⭐⭐ | ⭐⭐⭐ Needs index | Production-grade search |
Summary
DuckDB provides a comprehensive text processing toolkit:
- LIKE — Fast rule-based matching, best performance
- REGEXP — Flexible text extraction, replacement, and normalization
- Levenshtein — Edit distance for fuzzy matching and typo tolerance
- FTS Extension — Production-grade full-text search with relevance ranking
For smaller-scale text search scenarios, DuckDB’s built-in capabilities are often sufficient, without the need to introduce Elasticsearch.
For more DuckDB in-action tips, follow DuckDB Lab (duckdblab.org).