Convert encoding for multiple files recursively

Posted on Thu 14 January 2016 in Notes

If you have a large corpus of text files in, say euc-jp encoding, they can be quite difficult to work with, since most command-line tools on modern systems expects utf-8 files.

iconv can be used to convert file encodings from one known encoding to another. One problem on OSX is that the -o option doesn't work and instead you have to use the redirect operator >. Moreover you can't do this to overwrite an existing file, so if you have a large, complex directory structure you need to traverse recursively to change the encoding of each file, it becomes problematic.

I've found the following to work very well:

find . -type f -exec sh -c "iconv -f eucjp -t UTF-8 {} > {}.utf8"  \; -exec mv "{}".utf8 "{}" \;
  • find finds all files and directories recursively
  • . denotes starting directory. In this case, the current directory and thus everything below as well.
  • -type f limits the search to files only (so no directories will be returned)
  • -exec executes a command for each search result
  • sh -c opens bash shell, and executes the string followin -c
  • iconv -f eucjp -t UTF-8 converts encoding -f(rom) euc-jp to utf-8
  • {} denotes the search result (filename)
  • > the redirect operator. We run this line via the shell to get this to work, since it doesn't work if run directly via the -exec command (what a mess!)
  • {}.utf8 save to a file with “utf8” as the extension
  • "  \; close the bash command and close the -exec command.
  • -exec do another command with the search result
  • mv "{}".utf8 "{}" move the new file to the old filename, thus overwriting the original file
  • \; close the second -exec command.