# RCA Report: nginx charset_map utf-8 Source Charset NULL-Dereference Segfault

## Summary

A misconfigured `charset_map` directive with `utf-8` in the first column (source charset) causes nginx to create wrong-format single-byte conversion tables. When a subsequent HTTP request triggers the charset filter's `recode_from_utf8()` path, the 256-byte single-byte table is cast to `u_char **` and dereferenced as a pointer array (`table[n >> 8]`), reading garbage bytes as a memory address and crashing the worker process with SIGSEGV (signal 11). The upstream fix (commit `29c23ad846787e8baa1390b2edca479eb63ea8d7`) adds a configuration-time validation that rejects `charset_map` with `utf-8` in the first column, preventing the invalid configuration from ever being loaded.

## Impact

- **Package/Component affected:** nginx `src/http/modules/ngx_http_charset_filter_module.c` (the `ngx_http_charset_filter_module`)
- **Affected versions:** nginx versions prior to commit `29c23ad846787e8baa1390b2edca479eb63ea8d7` (tested on nginx/1.31.3 at parent commit `8f3465ac7f02b0ae86304e1be4ed319abb9d2edb`)
- **Risk level:** High — any attacker who can send an HTTP request to a server configured with the vulnerable `charset_map` directive causes an immediate worker process crash (denial of service). The crash occurs on every request to the affected location.
- **Consequences:** Repeated requests cause continuous worker respawns and crashes, degrading server availability. The crash is deterministic and triggered by a single HTTP GET request.

## Impact Parity

- **Disclosed/claimed maximum impact:** Denial of Service (DoS) via NULL pointer dereference / segfault in nginx worker process when processing requests with the misconfigured `charset_map`.
- **Reproduced impact from this run:** DoS confirmed — nginx worker process crashes with SIGSEGV (signal 11, core dumped) on every HTTP request to the affected location. The worker is killed immediately when processing response body data containing non-ASCII bytes through the `recode_from_utf8()` code path.
- **Parity:** `full` — the reproduced segfault/DoS matches the claimed impact exactly.
- **Not demonstrated:** No code execution or privilege escalation was claimed or observed; the impact is purely a DoS crash.

## Root Cause

The charset filter module supports two table formats:
1. **Single-byte tables** (256 bytes): used when neither charset in a `charset_map` is UTF-8. Each byte maps directly: `table[src_byte] = dst_byte`.
2. **UTF-8 multi-byte tables** (256 × `NGX_UTF_LEN` = 1024 bytes for `src2dst`, and an array of `u_char *` pointers for `dst2src`): used when the *destination* charset (second column) is UTF-8.

The bug occurs because `ngx_http_charset_map_block()` decides which table format to allocate based solely on whether `value[2]` (the **destination/second** column) is `"utf-8"`. When `utf-8` appears in `value[1]` (the **source/first** column) and the destination is a single-byte charset (e.g., `windows-1251`), the code takes the `else` branch and allocates 256-byte single-byte tables for both `src2dst` and `dst2src`.

During request processing, the charset filter's body filter calls `ngx_http_charset_recode_from_utf8()` when `ctx->from_utf8` is true (i.e., the source charset is UTF-8). This function casts `ctx->table` (the 256-byte buffer) to `u_char **table` and dereferences `table[n >> 8]` as a pointer:

```c
table = (u_char **) ctx->table;   // 256-byte buffer cast to pointer array
...
n = ngx_utf8_decode(&src, len);   // decode UTF-8 sequence to codepoint
if (n < 0x10000) {
    p = table[n >> 8];            // reads 8 bytes at offset (n>>8)*8 as a pointer
    if (p) {
        c = p[n & 0xff];          // dereferences the garbage pointer → SIGSEGV
```

For example, with Cyrillic `а` (U+0430, encoded as `0xD0 0xB0`), `ngx_utf8_decode` returns `n = 0x0430`, so `n >> 8 = 4`. `table[4]` reads bytes 32–39 of the 256-byte buffer (values `32,33,34,35,36,37,38,39`), which on little-endian 64-bit forms the garbage pointer `0x0000002726252423`. Since this is non-NULL, `p[n & 0xff]` dereferences `0x0000002726252453` — an unmapped address — causing SIGSEGV.

**Fix commit:** `29c23ad846787e8baa1390b2edca479eb63ea8d7` — "Charset: disabled charset_map with utf-8 in the first column". The fix adds a check in `ngx_http_charset_map_block()` that rejects the configuration at parse time:

```c
if (ngx_strcasecmp(value[1].data, (u_char *) "utf-8") == 0) {
    ngx_conf_log_error(NGX_LOG_EMERG, cf, 0,
                       "\"charset_map\" with \"utf-8\" charset "
                       "should be given in the second column");
    return NGX_CONF_ERROR;
}
```

## Reproduction Steps

1. **Script:** `bundle/repro/reproduction_steps.sh`
2. **What the script does:**
   - Locates pre-built nginx binaries from the project cache (vulnerable build at commit `8f3465ac7` and fixed build at commit `29c23ad84`), with a fallback to clone-and-build from source.
   - Creates an HTML file containing real UTF-8 multi-byte characters (Cyrillic `а`, `б`, `в` — bytes `0xD0 0xB0`, etc.) to trigger the non-ASCII code path.
   - **Vulnerable test (×2):** Writes an nginx config with `charset_map utf-8 windows-1251 { }` + `charset windows-1251` + `source_charset utf-8`, starts nginx as a real TCP listener, sends an HTTP GET request via curl, and checks the error log for `exited on signal 11` (SIGSEGV).
   - **Fixed test (×2):** Writes the same config and runs `nginx -t` to verify the config is rejected with the patch's error message.
   - **Config acceptance contrast:** Verifies the vulnerable binary accepts the config (exit 0) while the fixed binary rejects it.
   - Writes `bundle/repro/runtime_manifest.json` with proof artifacts.
3. **Expected evidence:** Two vulnerable attempts showing `worker process N exited on signal 11 (core dumped)` in the error log, and two fixed attempts showing `"charset_map" with "utf-8" charset should be given in the second column`.

## Evidence

### Log file locations
- `bundle/logs/vuln_error_1.log` — Vulnerable attempt 1 error log (segfault)
- `bundle/logs/vuln_error_2.log` — Vulnerable attempt 2 error log (segfault)
- `bundle/logs/vuln_conf_1.conf` / `vuln_conf_2.conf` — Vulnerable nginx configs
- `bundle/logs/fixed_test_1.log` / `fixed_test_2.log` — Fixed version config rejection
- `bundle/logs/vuln_config_accept.log` — Vulnerable config acceptance
- `bundle/repro/runtime_manifest.json` — Runtime evidence manifest

### Key excerpts

**Vulnerable worker segfault (attempt 1):**
```
2026/07/04 18:20:50 [alert] 30827#0: worker process 30829 exited on signal 11 (core dumped)
```

**Vulnerable worker segfault (attempt 2):**
```
2026/07/04 18:20:57 [alert] 30847#0: worker process 30849 exited on signal 11 (core dumped)
```

**Fixed version config rejection:**
```
nginx: [emerg] "charset_map" with "utf-8" charset should be given in the second column
nginx: configuration file ... test is successful → test failed (exit 1)
```

**Vulnerable version config acceptance:**
```
nginx: the configuration file ... syntax is ok
nginx: configuration file ... test is successful (exit 0)
```

### Environment
- nginx/1.31.3 built with `--without-http_rewrite_module --without-http_gzip_module --with-cc-opt='-g -O0'`
- Vulnerable commit: `8f3465ac7f02b0ae86304e1be4ed319abb9d2edb` (parent of fix)
- Fixed commit: `29c23ad846787e8baa1390b2edca479eb63ea8d7`
- gcc 15.2.0, Linux x86_64

## Recommendations / Next Steps

1. **Apply the upstream fix** (commit `29c23ad846787e8baa1390b2edca479eb63ea8d7`) to reject `charset_map` with `utf-8` in the first column at configuration parse time.
2. **Audit existing configurations** for any `charset_map` directives using `utf-8` as the source charset and remove or correct them.
3. **Add a regression test** that verifies `nginx -t` fails when `charset_map utf-8 <charset> { }` is present.
4. **Consider defensive coding** in `recode_from_utf8()` to validate table format before casting, as defense-in-depth against similar misconfigurations.

## Additional Notes

- **Idempotency:** The script uses randomized port bases to avoid TCP TIME_WAIT conflicts between consecutive runs. Verified to pass twice consecutively with exit code 0.
- **Ticket config note:** The ticket's exact map entry `D0B0 E0` (a 2-byte hex value) is rejected even in the vulnerable version because the single-byte parsing path (`else` branch in `ngx_http_charset_map()`) requires values ≤ 255. The vulnerability is triggered with any valid single-byte map entry (e.g., `C0 E0`) or even an empty `charset_map` block (`charset_map utf-8 windows-1251 { }`), since the table format mismatch occurs regardless of the entries.
- **Two crash paths:** The `charset_map utf-8 <non-utf8>` misconfiguration affects two request-time code paths:
  - **`recode_to_utf8`** (when `charset utf-8; source_charset <non-utf8>;`): performs an out-of-bounds read at `table[*src * NGX_UTF_LEN]` on the 256-byte buffer, causing response corruption ("zero size buf" alert) and connection failure.
  - **`recode_from_utf8`** (when `charset <non-utf8>; source_charset utf-8;`): casts the 256-byte buffer to `u_char **` and dereferences `table[n >> 8]` as a pointer, causing a reliable SIGSEGV.
  - The reproduction uses the `recode_from_utf8` path for its deterministic crash behavior. Both paths are eliminated by the same fix.
