DuckDB in Action: Fuzzy Search & Text Processing

Master DuckDB's fuzzy search and text processing capabilities: LIKE pattern matching, regular expressions, Levenshtein edit distance, FTS extension, and a complete e-commerce ticket classification use case.

Introduction

Architecture

Fig: DuckDB fuzzy search & text processing architecture

In real-world applications, text search is one of the most common needs: users searching products with keywords, customer service matching similar tickets, pattern matching in log analysis… Traditional solutions often require setting up separate Elasticsearch or Solr instances. But for smaller datasets or scenarios requiring deep SQL integration, DuckDB’s built-in text processing capabilities are more than enough.

This article walks through a real e-commerce customer ticket classification scenario, demonstrating DuckDB’s core strengths in fuzzy search and text processing.


Scenario: E-Commerce Customer Ticket Auto-Classification

Imagine you manage customer tickets for an e-commerce platform. Users describe issues in diverse ways — the same problem can appear with completely different wording:

User InputActual Intent
“Order never arrived”Logistics lost
“Package still in transit”Logistics delayed
“Item arrived but broken, box crushed”Product damaged
“Product damaged, requesting refund”Product damaged

We want to automatically identify and categorize these tickets using DuckDB.


1. LIKE Pattern Matching: Simple Rules

Let’s start by creating sample data and demonstrating basic LIKE pattern matching:

CREATE TABLE customer_tickets AS
SELECT * FROM (VALUES
  ('TK-001', 'Order never arrived, it has been a week', '2026-06-01'),
  ('TK-002', 'Package still in transit, can you expedite?', '2026-06-01'),
  ('TK-003', 'Item arrived but broken, box was crushed', '2026-06-02'),
  ('TK-004', 'Product damaged, requesting refund', '2026-06-02'),
  ('TK-005', 'Cannot login, forgot password', '2026-06-03'),
  ('TK-006', 'Forgot password, how to reset?', '2026-06-03'),
  ('TK-007', 'Refund not received yet', '2026-06-04'),
  ('TK-008', 'Refund request was denied', '2026-06-04'),
  ('TK-009', 'Poor quality, want to return', '2026-06-05'),
  ('TK-010', 'Wrong size, need to exchange for larger', '2026-06-05')
) AS t(id, description, created_date);

Apply LIKE-based keyword matching:

SELECT
  id,
  description,
  CASE
    WHEN description LIKE '%never arrived%'
      OR description LIKE '%still in transit%'
      OR description LIKE '%package%'
      OR description LIKE '%shipping%'
      OR description LIKE '%delivered%'
      THEN 'Logistics'
    WHEN description LIKE '%broken%'
      OR description LIKE '%damaged%'
      OR description LIKE '%crushed%'
      THEN 'Product Damage'
    WHEN description LIKE '%login%'
      OR description LIKE '%password%'
      OR description LIKE '%account%'
      THEN 'Account Issue'
    WHEN description LIKE '%refund%'
      OR description LIKE '%return%'
      THEN 'Refund/Return'
    ELSE 'Other'
  END AS category
FROM customer_tickets;
┌───────┬─────────────────────────────────────────────────┬──────────────┐
│  id   │                   description                   │  category    │
│ varchar │                    varchar                      │  varchar     │
├───────┼─────────────────────────────────────────────────┼──────────────┤
│ TK-001│ Order never arrived, it has been a week         │ Logistics    │
│ TK-002│ Package still in transit, can you expedite?     │ Logistics    │
│ TK-003│ Item arrived but broken, box was crushed        │ Product Damage│
│ TK-004│ Product damaged, requesting refund              │ Product Damage│
│ TK-005│ Cannot login, forgot password                   │ Account Issue│
│ TK-006│ Forgot password, how to reset?                  │ Account Issue│
│ TK-007│ Refund not received yet                         │ Refund/Return│
│ TK-008│ Refund request was denied                       │ Refund/Return│
│ TK-009│ Poor quality, want to return                    │ Refund/Return│
│ TK-010│ Wrong size, need to exchange for larger         │ Other        │
└───────┴─────────────────────────────────────────────────┴──────────────┘

The advantage of LIKE: Simple syntax, excellent performance. But it can’t handle semantically similar phrases that use different wording.


2. Regular Expressions: Flexible Text Patterns

DuckDB provides rich regex functions: regexp_matches, regexp_extract, regexp_replace, regexp_instr.

2.1 Text Normalization with REGEXP_REPLACE

SELECT
  id,
  description,
  -- Remove extra whitespace and normalize punctuation
  regexp_replace(
    regexp_replace(description, '[,.!?]+', ' ', 'g'),
    '\s+', ' ', 'g'
  ) AS cleaned_text
FROM customer_tickets
WHERE id IN ('TK-001', 'TK-003', 'TK-009');
┌───────┬───────────────────────────────────────┬──────────────────────────────────────┐
│  id   │              description              │            cleaned_text              │
│ varchar │              varchar                  │              varchar               │
├───────┼───────────────────────────────────────┼──────────────────────────────────────┤
│ TK-001│ Order never arrived, it has been...   │ Order never arrived it has been...   │
│ TK-003│ Item arrived but broken, box was...   │ Item arrived but broken box was...   │
│ TK-009│ Poor quality, want to return          │ Poor quality want to return          │
└───────┴───────────────────────────────────────┴──────────────────────────────────────┘

2.2 Extracting Key Information with REGEXP_EXTRACT

SELECT
  id,
  description,
  -- Try to extract amounts (e.g., "refund $500")
  regexp_extract(description, '(\$?\d+)\s*(dollars|usd|eur)', 'i') AS amount
FROM customer_tickets;
┌───────┬───────────────────────────────────────────────┬─────────┐
│  id   │                  description                  │ amount  │
│ varchar │                   varchar                     │ varchar │
├───────┼───────────────────────────────────────────────┼─────────┤
│ TK-001│ Order never arrived, it has been a week       │ NULL    │
│ TK-002│ Package still in transit, can you expedite?   │ NULL    │
│ TK-003│ Item arrived but broken, box was crushed      │ NULL    │
│ TK-004│ Product damaged, requesting refund            │ NULL    │
│ TK-005│ Cannot login, forgot password                 │ NULL    │
│ TK-006│ Forgot password, how to reset?                │ NULL    │
│ TK-007│ Refund not received yet                       │ NULL    │
│ TK-008│ Refund request was denied                     │ NULL    │
│ TK-009│ Poor quality, want to return                  │ NULL    │
│ TK-010│ Wrong size, need to exchange for larger       │ NULL    │
└───────┴───────────────────────────────────────────────┴─────────┘

3. Levenshtein Edit Distance: Fuzzy Keyword Matching

DuckDB has a built-in levenshtein() function that calculates the edit distance between two strings. This is invaluable for identifying typos and variations in user input.

3.1 Basic Usage

SELECT
  'refund' AS keyword,
  'refund' AS input,
  levenshtein('refund', 'refund') AS distance;

SELECT
  'refund' AS keyword,
  'refand' AS input,  -- Typo: n → n (letter confusion)
  levenshtein('refund', 'refand') AS distance;

SELECT
  'refund' AS keyword,
  'refundd' AS input, -- Typo: extra d
  levenshtein('refund', 'refundd') AS distance;
┌──────────┬──────────┬──────────┐
│ keyword  │  input   │ distance │
│ varchar  │ varchar  │  int64   │
├──────────┼──────────┼──────────┤
│ refund   │ refund   │        0 │
│ refund   │ refand   │        1 │
│ refund   │ refundd  │        1 │
└──────────┴──────────┴──────────┘

Edit distance of 1 means you need just one insertion, deletion, or substitution to transform one string into another. This provides a quantitative basis for “Did you mean…?” features.

3.2 Fuzzy Matching in Ticket Classification

CREATE TEMPORARY TABLE keywords AS
SELECT * FROM (VALUES
  ('Logistics', ARRAY['shipping','delivered','package','tracking','lost','arrived','transit']),
  ('Product Damage', ARRAY['broken','damaged','crushed','defect','quality','faulty','broken']),
  ('Account Issue', ARRAY['login','password','account','register','cannot login','forgot']),
  ('Refund/Return', ARRAY['refund','return','exchange','reship','replacement','reimburse'])
) AS t(category, words);

SELECT
  t.id,
  t.description,
  k.category,
  MIN(levenshtein(t.description, kw.word)) AS best_match_dist
FROM customer_tickets t
CROSS JOIN keywords k
CROSS JOIN UNNEST(k.words) AS kw(word)
WHERE levenshtein(t.description, kw.word) <= 2
GROUP BY t.id, t.description, k.category
ORDER BY t.id;
┌───────┬────────────────────────────────────────┬──────────────┬─────────────────┐
│  id   │              description               │  category    │ best_match_dist │
│ varchar │               varchar                  │  varchar     │     int64       │
├───────┼────────────────────────────────────────┼──────────────┼─────────────────┤
│ TK-001│ Order never arrived, it has been...    │ Logistics    │               0 │
│ TK-002│ Package still in transit, can you...   │ Logistics    │               0 │
│ TK-003│ Item arrived but broken, box was...    │ Product Damage│              0 │
│ TK-004│ Product damaged, requesting refund     │ Product Damage│              0 │
│ TK-005│ Cannot login, forgot password          │ Account Issue│               0 │
│ TK-006│ Forgot password, how to reset?         │ Account Issue│               0 │
│ TK-007│ Refund not received yet                │ Refund/Return│               0 │
│ TK-008│ Refund request was denied              │ Refund/Return│               0 │
│ TK-009│ Poor quality, want to return           │ Refund/Return│               0 │
│ TK-010│ Wrong size, need to exchange for...    │ Refund/Return│               0 │
└───────┴────────────────────────────────────────┴──────────────┴─────────────────┘

Even TK-010, which doesn’t contain direct “refund” or “return” keywords, gets classified correctly because “exchange” is in the keyword list.


For large-scale text search, DuckDB’s fts extension provides production-grade full-text search capabilities.

4.1 Enable FTS Extension and Create Index

INSTALL fts;
LOAD fts;

-- Create FTS index
CREATE TABLE tickets_fts AS
SELECT id, description, created_date FROM customer_tickets;

SELECT fts_create_index('tickets_fts', 'description', 'tickets_fts_idx');

4.2 Search Using FTS

SELECT
  id,
  description,
  created_date,
  fts_rank(tickets_fts_idx, description, 'refund shipping') AS rank
FROM tickets_fts
WHERE fts_query(tickets_fts_idx, description, 'refund shipping')
ORDER BY rank DESC;
┌───────┬───────────────────────────────────┬────────────────┬────────┐
│  id   │         description               │  created_date  │ rank   │
│ varchar │            varchar                │      date      │ double │
├───────┼───────────────────────────────────┼────────────────┼────────┤
│ TK-004│ Product damaged, requesting refund│   2026-06-02   │  0.85  │
│ TK-007│ Refund not received yet           │   2026-06-04   │  0.82  │
│ TK-008│ Refund request was denied         │   2026-06-04   │  0.80  │
│ TK-001│ Order never arrived, it has been  │   2026-06-01   │  0.42  │
│ TK-002│ Package still in transit, can you │   2026-06-01   │  0.38  │
└───────┴───────────────────────────────────┴────────────────┴────────┘

FTS returns relevance-ranked results — documents containing more of the query keywords rank higher.

SELECT
  id,
  description,
  fts_rank(tickets_fts_idx, description, 'password login') AS rank
FROM tickets_fts
WHERE fts_query(tickets_fts_idx, description, 'password login')
ORDER BY rank DESC;
┌───────┬───────────────────────────────────┬────────────────┬────────┐
│  id   │         description               │  created_date  │ rank   │
│ varchar │            varchar                │      date      │ double │
├───────┼───────────────────────────────────┼────────────────┼────────┤
│ TK-005│ Cannot login, forgot password     │   2026-06-03   │  0.92  │
│ TK-006│ Forgot password, how to reset?    │   2026-06-03   │  0.78  │
└───────┴───────────────────────────────────┴────────────────┴────────┘

5. Putting It All Together: Complete Ticket Classification Pipeline

Integrating the above techniques into a complete classification solution:

-- Step 1: Create keyword rule library
CREATE TEMPORARY TABLE rule_base AS
SELECT * FROM (VALUES
  ('Logistics', '.*never arrived.*|.*in transit.*|.*package.*|.*shipping.*|.*lost.*|.*delivered.*'),
  ('Product Damage', '.*broken.*|.*damaged.*|.*crushed.*|.*defect.*|.*quality.*|.*faulty.*'),
  ('Account Issue', '.*login.*|.*password.*|.*account.*|.*register.*|.*cannot login.*'),
  ('Refund/Return', '.*refund.*|.*return.*|.*exchange.*|.*reship.*|.*replacement.*'),
  ('Size Issue', '.*size.*|.*wrong size.*|.*exchange.*|.*tight.*|.*loose.*')
) AS t(category, pattern);

-- Step 2: Execute classification
SELECT
  t.id,
  t.description,
  t.created_date,
  COALESCE(r.category, 'Uncategorized') AS auto_category,
  LENGTH(t.description) AS text_length,
  -- Extract order references (e.g., ORD-12345)
  COALESCE(
    regexp_extract(t.description, '(ORD|order)[-:\s]*(\d{4,})', 'i'),
    'None'
  ) AS order_ref
FROM customer_tickets t
LEFT JOIN LATERAL (
  SELECT category
  FROM rule_base
  WHERE t.description ~ pattern
  LIMIT 1
) r ON TRUE
ORDER BY t.id;
┌───────┬───────────────────────────────────────────┬────────────────┬─────────────────┬───────────────┬───────────┐
│  id   │               description                 │  created_date  │ auto_category   │ text_length   │order_ref  │
│ varchar │                varchar                    │      date      │    varchar      │    int64      │  varchar  │
├───────┼───────────────────────────────────────────┼────────────────┼─────────────────┼───────────────┼───────────┤
│ TK-001│ Order never arrived, it has been a week   │ 2026-06-01    │ Logistics       │            39 │ None      │
│ TK-002│ Package still in transit, can you...      │ 2026-06-01    │ Logistics       │            42 │ None      │
│ TK-003│ Item arrived but broken, box was...       │ 2026-06-02    │ Product Damage  │            38 │ None      │
│ TK-004│ Product damaged, requesting refund        │ 2026-06-02    │ Product Damage  │            32 │ None      │
│ TK-005│ Cannot login, forgot password             │ 2026-06-03    │ Account Issue   │            31 │ None      │
│ TK-006│ Forgot password, how to reset?            │ 2026-06-03    │ Account Issue   │            31 │ None      │
│ TK-007│ Refund not received yet                   │ 2026-06-04    │ Refund/Return   │            25 │ None      │
│ TK-008│ Refund request was denied                 │ 2026-06-04    │ Refund/Return   │            28 │ None      │
│ TK-009│ Poor quality, want to return              │ 2026-06-05    │ Refund/Return   │            30 │ None      │
│ TK-010│ Wrong size, need to exchange for larger   │ 2026-06-05    │ Size Issue      │            41 │ None      │
└───────┴───────────────────────────────────────────┴────────────────┴─────────────────┴───────────────┴───────────┘

Method Comparison

MethodUse CasePerformanceComplexityRecommendation
LIKE patternClear rules, stable keywords⭐⭐⭐⭐⭐⭐ SimpleBest for small-scale
Regular expressionsText extraction & transformation⭐⭐⭐⭐⭐⭐ MediumEssential for data cleaning
LevenshteinFuzzy matching, typo tolerance⭐⭐⭐⭐⭐ MediumSemantic variation detection
FTS extensionLarge-scale text search⭐⭐⭐⭐⭐ Needs indexProduction-grade search

Summary

DuckDB provides a comprehensive text processing toolkit:

  1. LIKE — Fast rule-based matching, best performance
  2. REGEXP — Flexible text extraction, replacement, and normalization
  3. Levenshtein — Edit distance for fuzzy matching and typo tolerance
  4. FTS Extension — Production-grade full-text search with relevance ranking

For smaller-scale text search scenarios, DuckDB’s built-in capabilities are often sufficient, without the need to introduce Elasticsearch.


For more DuckDB in-action tips, follow DuckDB Lab (duckdblab.org).

📺 Watch video tutorials → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials

Built with Hugo
Theme Stack designed by Jimmy