DuckDB + Streamlit: Build a Log Anomaly Detection Dashboard in 5 Minutes

1. The Pain: Still Using grep and awk at 2 AM?

Your phone buzzes. Production alert.

You SSH into the server, tail -n 1000 access.log, grep 500, awk '{print $7}', manually count which API is failing most. Fifteen minutes gone. If the log file is multi-GB, grep pegs the CPU and slows down production traffic.

The traditional grep + awk workflow has fundamental problems:

Scenario	Pain Point	Consequence
GB-sized logs	grep maxes out CPU	Affects production
Multi-dimension analysis	Pipe multiple awk/sed commands	Takes 30 min to compose
Trend analysis	Manual cross-file comparison	Misses patterns
Reporting	None by default	Re-discover every time
Team collaboration	Screenshots + chat messages	Inefficient, error-prone

The DuckDB solution: Treat your logs like a database table.

Parse raw Nginx logs into structured fields using DuckDB’s regexp_extract, then run SQL aggregations — status code distribution, slowest API endpoints, time-series trends, top error-causing users. One SQL query does what 10 shell commands used to.

2. Parsing Nginx Logs with DuckDB

2.1 Nginx Log Format

A typical Nginx access log (combined format):

192.168.1.1 - - [13/May/2026:10:15:30 +0800] "GET /api/users HTTP/1.1" 200 1234 "-" "Mozilla/5.0"
192.168.1.2 - - [13/May/2026:10:15:31 +0800] "POST /api/orders HTTP/1.1" 500 56 "-" "curl/7.68"
192.168.1.1 - - [13/May/2026:10:15:32 +0800] "GET /api/products HTTP/1.1" 200 8901 "-" "Mozilla/5.0"
192.168.1.3 - - [13/May/2026:10:15:33 +0800] "POST /api/orders HTTP/1.1" 502 0 "-" "python-requests/2.25"

2.2 Parse with DuckDB’s `regexp_extract`

DuckDB’s built-in regexp_extract lets you extract fields without leaving SQL:

WITH parsed AS (
  SELECT
    regexp_extract(log_line, '^([^ ]+)') AS ip,
    regexp_extract(log_line, '\[([^\]]+)\]') AS timestamp_raw,
    regexp_extract(log_line, '"([^"]+)"') AS request,
    regexp_extract(log_line, ' (\d{3}) ')::INT AS status_code,
    regexp_extract(log_line, ' (\d+) "')::INT AS body_bytes,
    regexp_extract(log_line, '"([^"]*)"$') AS user_agent
  FROM read_text('access.log')
)
SELECT status_code, count(*) AS cnt
FROM parsed
GROUP BY status_code
ORDER BY cnt DESC;

Sample output:

status_code	cnt
200	8452
404	123
500	45
502	12

Better than grep 500 | wc -l? You get the full distribution in one shot, and you can chain more analysis on top.

2.3 Advanced: Parse HTTP Method and Path

Nginx request format: "GET /api/users HTTP/1.1". Let’s split it:

WITH parsed AS (
  SELECT
    regexp_extract(log_line, '"([^"]+)"') AS request,
    regexp_extract(log_line, ' (\d{3}) ')::INT AS status_code
  FROM read_text('access.log')
)
SELECT
  regexp_extract(request, '^([^ ]+)') AS http_method,
  regexp_extract(request, ' ([^ ]+) ') AS path,
  status_code,
  count(*) AS cnt
FROM parsed
GROUP BY http_method, path, status_code
ORDER BY cnt DESC
LIMIT 10;

Now you can instantly see: POST /api/orders has 23 500 errors, while GET /api/users is totally clean.

3. Full Project: Log Anomaly Detection Dashboard

Below is a complete Python script that generates mock logs, analyzes them with DuckDB, and provides both CLI output and a Streamlit interactive dashboard.

Prerequisites

pip install duckdb streamlit pandas openpyxl

Complete Code

#!/usr/bin/env python3
"""
DuckDB + Streamlit Log Anomaly Detection Dashboard
Generates mock Nginx logs → DuckDB analysis → Streamlit dashboard → Excel export
"""

import duckdb
import pandas as pd
import random
import datetime
import os

# ============================================================
# Step 1: Generate Mock Nginx Access Logs
# ============================================================

def generate_nginx_logs(num_lines=10000, output_file="nginx_access.log"):
    """Generate mock Nginx access logs with some anomalies"""
    
    ips = [f"192.168.1.{i}" for i in range(1, 21)]
    paths = [
        "/api/users", "/api/products", "/api/orders",
        "/api/payments", "/api/auth/login", "/api/auth/logout",
        "/api/search", "/api/recommend", "/api/cart", "/api/checkout"
    ]
    methods = ["GET", "POST", "PUT", "DELETE"]
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
        "curl/7.68.0",
        "python-requests/2.25.1",
        "PostmanRuntime/7.28.4"
    ]
    
    base_time = datetime.datetime(2026, 5, 13, 0, 0, 0)
    
    with open(output_file, "w") as f:
        for i in range(num_lines):
            base_time += datetime.timedelta(seconds=random.uniform(0.1, 5))
            timestamp = base_time.strftime("%d/%b/%Y:%H:%M:%S +0800")
            
            ip = random.choice(ips)
            method = random.choice(methods)
            path = random.choice(paths)
            
            # Inject ~5% server errors
            if random.random() < 0.05:
                status = random.choice([500, 502, 503, 504])
                bytes_sent = random.randint(0, 200)
                response_time = random.uniform(3, 15)
            elif random.random() < 0.10:
                status = random.choice([400, 401, 403, 404, 429])
                bytes_sent = random.randint(50, 500)
                response_time = random.uniform(0.1, 2)
            else:
                status = random.choice([200, 201, 204, 301, 302])
                bytes_sent = random.randint(200, 15000)
                response_time = random.uniform(0.01, 1.5)
            
            ua = random.choice(user_agents)
            
            log_line = (
                f'{ip} - - [{timestamp}] '
                f'"{method} {path} HTTP/1.1" {status} {bytes_sent} '
                f'"{random.choice(["-", "https://example.com"])}" '
                f'"{ua}" {response_time:.3f}\n'
            )
            f.write(log_line)
    
    print(f"✅ Generated {num_lines} mock log lines → {output_file}")
    return output_file


# ============================================================
# Step 2: DuckDB Log Analysis Engine
# ============================================================

class LogAnalyzer:
    """DuckDB-powered log analysis engine"""
    
    def __init__(self, log_file="nginx_access.log"):
        self.con = duckdb.connect()
        self.log_file = log_file
        self._load_and_parse()
    
    def _load_and_parse(self):
        """Load log file and parse into structured fields"""
        self.con.execute(f"""
            CREATE TABLE logs AS
            SELECT
                regexp_extract(line, '^([^ ]+)') AS ip,
                regexp_extract(line, '\\[([^\\]]+)\\]') AS timestamp_raw,
                regexp_extract(line, '"([^"]+)"') AS request,
                regexp_extract(line, ' (\\d{{3}}) ')::INT AS status_code,
                regexp_extract(line, ' (\\d+) "')::INT AS body_bytes,
                regexp_extract(line, '"([^"]*)"$') AS user_agent,
                regexp_extract(line, ' (\\d+\\.\\d+)$')::DOUBLE AS response_time
            FROM read_text('{self.log_file}')
        """)
        
        # Parse HTTP method and path
        self.con.execute("""
            ALTER TABLE logs ADD COLUMN http_method VARCHAR;
            ALTER TABLE logs ADD COLUMN path VARCHAR;
        """)
        self.con.execute("""
            UPDATE logs SET
                http_method = regexp_extract(request, '^([^ ]+)'),
                path = regexp_extract(request, ' ([^ ]+) ')
        """)
        
        # Parse timestamp
        self.con.execute("""
            ALTER TABLE logs ADD COLUMN request_time TIMESTAMP;
        """)
        self.con.execute("""
            UPDATE logs SET
                request_time = strptime(
                    regexp_replace(timestamp_raw, ':', ' ', 1, 1),
                    '%d/%b/%Y %H:%M:%S'
                )
        """)
        
        row_count = self.con.execute("SELECT count(*) FROM logs").fetchone()[0]
        print(f"✅ DuckDB parsed {row_count} log entries")
    
    def status_distribution(self):
        """Analysis 1: Status code distribution"""
        return self.con.execute("""
            SELECT status_code, count(*) AS cnt,
                   round(count(*) * 100.0 / sum(count(*) OVER (), 2) AS pct
            FROM logs
            GROUP BY status_code
            ORDER BY cnt DESC
        """).fetchdf()
    
    def error_paths(self, top_n=10):
        """Analysis 2: Most error-prone API paths"""
        return self.con.execute(f"""
            SELECT path, http_method,
                   count(*) AS total_requests,
                   sum(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS server_errors,
                   sum(CASE WHEN status_code >= 400 AND status_code < 500 THEN 1 ELSE 0 END) AS client_errors,
                   round(AVG(response_time), 3) AS avg_response_time,
                   round(MAX(response_time), 3) AS max_response_time
            FROM logs
            GROUP BY path, http_method
            HAVING server_errors > 0 OR client_errors > 0
            ORDER BY server_errors DESC
            LIMIT {top_n}
        """).fetchdf()
    
    def slowest_apis(self, top_n=10):
        """Analysis 3: Slowest API endpoints by P95 latency"""
        return self.con.execute(f"""
            SELECT path, http_method,
                   count(*) AS cnt,
                   round(AVG(response_time), 3) AS avg_ms,
                   round(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time), 3) AS p95_ms,
                   round(MAX(response_time), 3) AS max_ms
            FROM logs
            GROUP BY path, http_method
            HAVING cnt > 5
            ORDER BY avg_ms DESC
            LIMIT {top_n}
        """).fetchdf()
    
    def time_series(self, interval='5 minutes'):
        """Analysis 4: Time series trend"""
        return self.con.execute(f"""
            SELECT date_trunc('{interval}', request_time) AS bucket,
                   count(*) AS total_requests,
                   sum(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS errors,
                   round(AVG(response_time), 3) AS avg_response_time
            FROM logs
            GROUP BY bucket
            ORDER BY bucket
        """).fetchdf()
    
    def top_error_users(self, top_n=5):
        """Analysis 5: Top error-causing IPs"""
        return self.con.execute(f"""
            SELECT ip,
                   count(*) AS total_requests,
                   sum(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS server_errors,
                   round(AVG(response_time), 3) AS avg_response_time
            FROM logs
            GROUP BY ip
            HAVING server_errors > 0
            ORDER BY server_errors DESC
            LIMIT {top_n}
        """).fetchdf()
    
    def export_excel(self, output_file="log_analysis_report.xlsx"):
        """Export complete report to Excel"""
        with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
            self.status_distribution().to_excel(writer, sheet_name='Status Codes', index=False)
            self.error_paths().to_excel(writer, sheet_name='Error Paths', index=False)
            self.slowest_apis().to_excel(writer, sheet_name='Slow APIs', index=False)
            self.time_series().to_excel(writer, sheet_name='Time Trends', index=False)
            self.top_error_users().to_excel(writer, sheet_name='Problem Users', index=False)
        print(f"✅ Report exported → {output_file}")
        return output_file


# ============================================================
# Step 3: Streamlit Dashboard
# ============================================================

def run_dashboard():
    """Launch Streamlit interactive dashboard"""
    import streamlit as st
    
    st.set_page_config(
        page_title="Log Anomaly Detection Dashboard",
        page_icon="📊",
        layout="wide"
    )
    
    st.title("📊 Log Anomaly Detection Dashboard")
    st.markdown("DuckDB-powered Nginx access log analysis engine")
    
    # Initialize
    log_file = "nginx_access.log"
    if not os.path.exists(log_file):
        st.info("Generating mock log data...")
        generate_nginx_logs(10000, log_file)
    
    analyzer = LogAnalyzer(log_file)
    
    # ---- Key Metrics ----
    col1, col2, col3, col4 = st.columns(4)
    with col1:
        total = analyzer.con.execute("SELECT count(*) FROM logs").fetchone()[0]
        st.metric("Total Requests", f"{total:,}")
    with col2:
        errors = analyzer.con.execute(
            "SELECT count(*) FROM logs WHERE status_code >= 500"
        ).fetchone()[0]
        st.metric("Server Errors", errors)
    with col3:
        client_errors = analyzer.con.execute(
            "SELECT count(*) FROM logs WHERE status_code >= 400 AND status_code < 500"
        ).fetchone()[0]
        st.metric("Client Errors", client_errors)
    with col4:
        avg_resp = analyzer.con.execute(
            "SELECT round(AVG(response_time), 3) FROM logs"
        ).fetchone()[0]
        st.metric("Avg Response Time", f"{avg_resp:.2f}s")
    
    # ---- Tabs ----
    tab1, tab2, tab3, tab4, tab5 = st.tabs([
        "🔴 Error Analysis", "🐢 Slow APIs", "📈 Trends", "👤 Users", "📋 Status Codes"
    ])
    
    with tab1:
        st.subheader("Most Error-Prone API Paths")
        df_errors = analyzer.error_paths(15)
        st.dataframe(df_errors, use_container_width=True)
        st.bar_chart(df_errors.set_index("path")["server_errors"])
    
    with tab2:
        st.subheader("Slowest APIs (P95 Latency)")
        df_slow = analyzer.slowest_apis(15)
        st.dataframe(df_slow, use_container_width=True)
        st.bar_chart(df_slow.set_index("path")["p95_ms"])
    
    with tab3:
        st.subheader("Request Volume & Error Trend")
        df_ts = analyzer.time_interval()
        st.line_chart(df_ts.set_index("bucket")[["total_requests", "errors"]])
    
    with tab4:
        st.subheader("Top Error-Causing Client IPs")
        df_users = analyzer.top_error_users(10)
        st.dataframe(df_users, use_container_width=True)
    
    with tab5:
        st.subheader("HTTP Status Code Distribution")
        df_status = analyzer.status_distribution()
        st.dataframe(df_status, use_container_width=True)
        st.bar_chart(df_status.set_index("status_code")["cnt"])
    
    # ---- Export ----
    if st.button("📥 Export Full Report (Excel)"):
        filepath = analyzer.export_excel()
        with open(filepath, "rb") as f:
            st.download_button(
                "Download Excel Report",
                f,
                file_name="log_analysis_report.xlsx",
                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
            )
    
    st.markdown("---")
    st.caption("Powered by DuckDB 🦆 + Streamlit")


# ============================================================
# Entry Point
# ============================================================

if __name__ == "__main__":
    import sys
    
    if "streamlit" in sys.argv[0] or "STREAMLIT_SCRIPT" in os.environ:
        run_dashboard()
    else:
        # CLI mode
        print("=" * 50)
        print("🦆 DuckDB Log Analyzer (CLI Mode)")
        print("=" * 50)
        
        log_file = generate_nginx_logs(10000)
        analyzer = LogAnalyzer(log_file)
        
        print("\n📊 Status Code Distribution:")
        print(analyzer.status_distribution().to_string(index=False))
        
        print("\n🔴 Most Error-Prone APIs:")
        print(analyzer.error_paths().to_string(index=False))
        
        print("\n🐢 Slowest APIs:")
        print(analyzer.slowest_apis().to_string(index=False))
        
        print("\n👤 Problem Users:")
        print(analyzer.top_error_users().to_string(index=False))
        
        analyzer.export_excel()
        
        print("\n✅ Analysis complete!")
        print("💡 Tip: Run `streamlit run this_script.py` for interactive dashboard")

How to Run

CLI mode (quick analysis + Excel export):

python3 log_analyzer.py

Dashboard mode (Streamlit web UI):

streamlit run log_analyzer.py

4. Performance Comparison

Dimension	Traditional (grep/awk)	DuckDB + Streamlit	Improvement
1GB log analysis	3-5 min, CPU 100%	5-10 seconds	30x
Multi-dimension analysis	Complex pipe chains	Single SQL query	∞
Interactive exploration	Not supported	Real-time filtering	New
Reporting	Manual screenshot assembly	One-click Excel export	Save 30 min
Historical trends	Manual cross-file compare	Aggregated time series	New
Team sharing	Screenshots + chat	Dashboard URL	New

5. Monetization Strategy

Target Customers

Customer Type	Pain Point	Price
Startups (10-50 people)	No log system, all SSH-based	$400-800/setup
Mid-size e-commerce	Large Nginx logs, needs periodic analysis	$800-1,200/setup
Mobile app/WeChat teams	Need API quality monitoring	$600-900/setup
Cloud service resellers	Offer log analysis to downstream clients	$1,200-2,500/project

Delivery Checklist

Docker deployment script (one-command startup)
Nginx log format adapter (supports custom log_format)
Analysis dashboard (5 core dimensions)
Scheduled report (daily auto-send via email/webhook)
Alert configuration (threshold-based notifications)

Comparison with Alternatives

Solution	Price	Complexity	Best For
ELK Stack (Elasticsearch + Logstash + Kibana)	Free ops cost is real	⭐⭐⭐⭐⭐	Large-scale log platform
Datadog / New Relic	$15-30/host/month	⭐⭐	Cloud-native teams
Self-hosted Grafana + Loki	Free, needs K8s	⭐⭐⭐⭐	Teams with ops talent
DuckDB + Streamlit	Completely free	⭐	Small teams, indie devs

Upsell Opportunities

Multi-server aggregation: Collect logs from N servers via SCP/rsync
Real-time alerts: Integrate Slack/DingTalk/WeChat Webhook for error notifications
Custom dashboards: Let clients configure dimensions via a YAML/JSON file
Historical archiving: Weekly/monthly rollups for trend comparison

6. Summary

DuckDB’s advantages for log analysis:

Zero ops overhead: No Elasticsearch, Logstash, Kibana stack needed — just pip install duckdb streamlit
SQL superpowers: regexp_extract + date_trunc + PERCENTILE_CONT turn log parsing from string hacking into real data analysis
Performance: Vectorized engine handles GB-sized logs in seconds
Deliverable: Streamlit dashboard + Excel export — clients don’t need to learn any tool

Bottom line: What used to take 30 minutes of grep/awk troubleshooting now takes 5 minutes with a full diagnostic report. Sell this skill to startups for $400+ per setup.