John Siu Blog

Tech - Business Tool, Personal Toys

Text File Encode/Charset Conversion

☰ Table of Content

From time to time we get files with garbage characters.

Garbage Text

Take following example:

1
2
$cat test.txt
���ؐ��Γy�����p���@�␼�k

iconv

There is tool call iconv to fix that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ iconv --help                                                                                                                                                          64 ↵
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.

 Input/Output format specification:
  -f, --from-code=NAME       encoding of original text
  -t, --to-code=NAME         encoding for output

 Information:
  -l, --list                 list all known coded character sets

 Output control:
  -c                         omit invalid characters from output
  -o, --output=FILE          output file
  -s, --silent               suppress warnings
      --verbose              print progress information

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

For bug reporting instructions, please see:
<https://bugs.launchpad.net/ubuntu/+source/glibc/+bugs>.

Minimum we have to supply -f, --from-code=NAME and -t, --to-code=NAME. Obvious choice for -t is UTF8. But what about -f?

uchardet

uchardet is “Universal Charset Detector”.

1
uchardet <file.txt>

Using it on example above:

1
2
$ uchardet test.txt
SHIFT_JIS

This tell us test.txt content is using SHIFT_JIS, a common encoding for Japanese text files and websites.

Combine

Putting everything together:

1
2
3
4
$ uchardet test.txt
SHIFT_JIS
$ iconv -f SHIFT_JIS -t UTF8 test.txt
金木水火土中日英美法俄西北

We can put above in a script:

iconv_cat:

1
2
#!/bin/sh
iconv -f $(uchardet $1) -t UTF8 $1

Example:

1
2
iconv_cat test.txt
金木水火土中日英美法俄西北

John Siu

Update: 2020-08-20
comments powered by Disqus