Why Your PowerShell Script Produces Mojibake (And How to Fix It)

If you’ve ever written a PowerShell script that produces a file containing non-ASCII characters, like an → (arrow), a ≈ (approximately equal), or characters from a non-English alphabet, you may have run into one of the more confusing encoding bugs in the Windows ecosystem. The bug shows up most often when the script was generated by an LLM or agentic coding tool, but it’s been around far longer than those tools have.

The script looks correct. The file is written with UTF8Encoding. The output viewer reads it as UTF-8. And yet, the result is mojibake: — instead of an em-dash, → instead of .

In this post, we’ll dive into why this happens. The cause sits one layer earlier than most developers think, and once you see it, the fix becomes obvious.


The setup

Suppose we want to write a Markdown file from PowerShell that contains a Unicode arrow character:

    $content = "Build → Test → Deploy"
    $utf8NoBom = New-Object System.Text.UTF8Encoding $false
    [System.IO.File]::WriteAllText("README.md", $content, $utf8NoBom)

This looks correct. We’ve explicitly chosen UTF-8 without BOM. We’re using the .NET WriteAllText method, which is well-behaved. So when we open README.md in a UTF-8-aware editor or push it to GitHub, we should see:

Build → Test → Deploy

But on many Windows systems, what you actually see is:

Build â†' Test â†' Deploy

That → is the classic signature of a UTF-8 byte sequence being mis-decoded as Windows-1252. The arrow character (→, Unicode U+2192) takes three bytes in UTF-8: “0xE2 0x86 0x92”. If something interprets those three bytes as Windows-1252 instead, it sees three separate characters: â, †, ‘, which is exactly the mojibake we’re getting.

The question is where in the pipeline that misinterpretation happens.


Where the encoding gets lost

There are three layers where encoding could go wrong:

  1. The PowerShell source-file reader. When PowerShell opens a .ps1 file, it has to decide how to interpret the bytes.
  2. The PowerShell string subsystem. Strings in PowerShell are .NET strings (UTF-16 internally), but how they’re built matters.
  3. The output writer. WriteAllText writes whatever bytes the encoder produces.

The third layer is the most-suspected and least-likely culprit. WriteAllText is straightforward; if you give it a correct .NET string and a UTF-8 encoder, it produces correct UTF-8 bytes.

The real problem is layer 1.

When PowerShell reads a .ps1 script with no byte-order mark (BOM), it falls back to the system’s “ANSI” code page, which on most English-locale Windows machines is Windows-1252. If the script file was actually saved as UTF-8 (which is increasingly the default in modern editors like VS Code), every multi-byte UTF-8 sequence in the script is mis-decoded at parse time.

In other words: by the time PowerShell starts executing your code, the literal in your source has already been corrupted to → in memory. Your WriteAllText call faithfully writes what’s now in the string variable, and the corruption ends up on disk.

The output writer is innocent. The string subsystem is innocent. The damage was done before your script ran.


How to recognise it

Two patterns make this bug easy to spot:

  • The output contains â sequences. That’s the UTF-8-as-Windows-1252 signature. — is a mis-decoded em-dash; → is a mis-decoded arrow; ’ is a mis-decoded right single quote (‘); “ and †are mis-decoded smart double quotes.
  • The bug appears even when you’ve explicitly used UTF8Encoding. If you’ve audited your write-time encoding and it’s UTF-8 but the output is still wrong, the corruption almost certainly happened earlier.

A quick diagnostic: open the script in VS Code, look at the encoding indicator in the bottom-right of the status bar. If it says “UTF-8” but you have non-ASCII characters in the script, you’re in the trap zone unless you’ve taken explicit steps.


Fix 1: save the script as UTF-8 with BOM

The simplest fix is to give PowerShell an unambiguous signal about your script’s encoding. A UTF-8 byte-order mark (the three bytes “0xEF 0xBB 0xBF” at the start of the file) tells PowerShell, “this is UTF-8, read it as UTF-8.”

In VS Code:

  1. Open the script file.
  2. Click the encoding indicator in the bottom-right (it’ll say “UTF-8“).
  3. Choose “Save with Encoding”, then “UTF-8 with BOM“.

The script is now self-describing. Every PowerShell version reads it correctly, and any literal Unicode characters in your strings survive unchanged.

The only minor downside is that the BOM bytes are sometimes visible in tools that aren’t BOM-aware (rare in modern toolchains, but worth knowing). For most PowerShell scripts, UTF-8 with BOM is the right default.


Fix 2: avoid non-ASCII characters in script source

If you can’t control how the script file is saved, for example, if the script is generated by another tool, or distributed across machines with different defaults, there’s a more bulletproof approach: don’t put non-ASCII characters in the script source at all. Construct them at runtime from numeric Unicode code points.

PowerShell can cast an integer to a [char]:

    $rightArrow = [char]0x2192   # →
    $approxEq   = [char]0x2248   # ≈
    $bullet     = [char]0x2022   # •

These constructions can’t be mangled by the source-file reader, because nothing non-ASCII appears in the script file. The Unicode characters are produced at runtime as proper .NET char values, regardless of how the source bytes were interpreted.

You can then build your output strings from these:

    $title = "Build $rightArrow Test $rightArrow Deploy"
    $utf8NoBom = New-Object System.Text.UTF8Encoding $false
    [System.IO.File]::WriteAllText("README.md", $title, $utf8NoBom)

This produces correct UTF-8 every time, on every system, regardless of code page. The trade-off is readability; [char]0x2192 is uglier than , but in scripts that need to be portable, it’s a reasonable price.

Common code points worth knowing:

  • 0x2013 and 0x2014: en-dash and em-dash
  • 0x2018 and 0x2019: left and right single quotes
  • 0x201C and 0x201D: left and right double quotes
  • 0x2026: ellipsis
  • 0x2192: right arrow
  • 0x2248: approximately equal
  • 0x251C, 0x2500, 0x2502, 0x2514: box-drawing characters

Which fix should you use?

For most cases, save the script as UTF-8 with BOM. It’s readable, it’s a one-time setting in your editor, and it eliminates the problem at the source.

Reach for [char]0x… constructions only when the script file’s encoding is outside your control. For example, when generating scripts programmatically, distributing them across mixed environments, or working in editors that don’t reliably preserve UTF-8 BOMs.


This bug isn’t new. It’s existed for as long as PowerShell and UTF-8 have coexisted on Windows. But it’s becoming more frequent as developers lean on LLMs and agentic coding tools to generate scripts.

There are a few reasons for that.

First, LLMs reach for Unicode characters more readily than humans typing in a hurry. A model writing a script will produce arrows (→), approximately-equal symbols (≈), smart quotes, and box-drawing characters because that output reads better. A developer typing the same script by hand would usually stick to ASCII (->, ~=, “) without thinking about it. The richer the output, the more likely it is to contain characters that get mangled by a code-page mismatch.

Second, agentic flows often bypass the editor entirely. When you copy text from an LLM into VS Code yourself, you control the file’s saved encoding, and modern editors usually save UTF-8 with a BOM by default if the content needs it. When an agent writes the file directly to disk through some tooling layer, the saved encoding is whatever the tool defaulted to, frequently UTF-8 without a BOM, which is exactly the case PowerShell handles badly.

Third, the LLM has no visibility into your local environment. It doesn’t know whether you’re on Windows PowerShell 5.1 with a Windows-1252 code page or PowerShell 7+ with UTF-8 defaults. The same script that runs cleanly in one environment produces mojibake in the other, and there’s no warning at generation time.

Finally, when an LLM hands you a long script, you tend to skim and run rather than read line by line. You’ll only notice the encoding problem when the output is wrong, and at that point, the natural reaction is to blame the LLM (“it’s outputting weird characters”) rather than the toolchain that mangled the LLM’s correct output between generation and execution.

The practical consequence: if you’re using LLM-assisted or agent-written PowerShell scripts on Windows, expect to hit this bug more often than you used to. The fixes in this post are the same; the frequency is higher. The first thing to check when you see â sequences in your LLM-generated output isn’t the prompt. It’s the script file’s encoding on disk.

To make this concrete: if you’ve ever asked Claude, ChatGPT, or Copilot to “write a PowerShell script that creates a README with a folder tree,” the model will produce output containing characters like ├──,, and └──. That output is correct: those are real Unicode box-drawing characters that look great in any UTF-8-aware viewer. The bug is that PowerShell, on Windows, can’t reliably read its own script file containing those characters unless the file is saved with a BOM. Same story for arrows, smart quotes, accented letters, or any other “richer than ASCII” content the model emits naturally.

To be clear: LLMs don’t cause this bug. The bug is the encoding mismatch between PowerShell’s default source-file reading and the editors and tools that produce UTF-8 without a BOM. LLMs just make the bug more visible, by producing more of the kind of content that triggers it. The mismatch was always there.


Conclusion

The PowerShell encoding bug is a good example of how layered tooling can silently corrupt data. The output writer was correct, the encoder was correct, the string was internally well-formed in .NET. But the first read of the source file mangled the literal characters before any of that mattered.

Two practical takeaways:

  1. Trust BOMs more than you’d expect to. A three-byte signature on a file isn’t ugly clutter. It’s a self-description that prevents an entire class of encoding bugs.
  2. When in doubt, push non-ASCII through numeric code points. It’s verbose, but it bypasses every layer that could mis-decode it.

PowerShell isn’t unique in this. Python 2 had similar issues, and any language whose string handling depends on the host’s code page can hit comparable bugs. But on modern Windows, where most editors default to UTF-8 and most tools expect UTF-8, the mismatch with PowerShell’s default-ANSI source reading is one of the most common quiet failures you’ll encounter.

Now you know where to look. Adios amigos 😊

Similar Posts