Unzip zip file with specially encoded filenames inside

Category: linux


Encoding is always hard to get right. At least to many programs and programers I have seen, getting encoding right is not a trivial task at all. If we just need to work with the 26 characters in English and a few digits and symbols, a lot of problems will be gone. But since the fall of Towser of Babel, we have so many languages–and my mother tongue is one of the most special and ancient languages–Chinese–so I read and work with Chinese stuff a lot.

There are several popular encodings of Chinese character set–Guobiao, Big5(And since Chinese, Korean and Japanese share a lot in culture, we could see a lot of Chinese charaters in Japanese and Korean as well. So encodings like Shift-JIS do handle Chinese charaters as well). Other than these special encodings just for Chinese, we have the unicode standard which aims to include all writable characters–so Chinese could be represented in unicode. The most famous encoding is UTF-8(this encoding is just such a great design!).

That is for the background knowledge of Chinese encodings. Encoding is a big issue and after getting some basics, we will be able to understand why some garbage comes out of some file downloaded from the Internet. In this post I just want to talk about unzip in Linux(Ubuntu as the example cause I work with it most of the time).

So I downloaded some file(A collection of novels) from Baidu Pan. It is a zip file and I tried to use Archive Manager to view its content–well, garbage folder names and file names. And I cannot even extract them because of the bad encoding. So I switched to command line and I see this output:

➜  Downloads unzip -l 东野圭吾.zip
Archive:  东野圭吾.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2012-12-29 12:47   ╢л╥░╣ч╬с/
    49230  2012-12-29 12:44   ╢л╥░╣ч╬с/╥╘╒г╤█╕╔▒н.txt
    40890  2012-12-29 12:23   ╢л╥░╣ч╬с/╥┴╢╣┬├╣▌╡─╔ё├╪░╕.txt
   158806  2012-12-29 12:45   ╢л╥░╣ч╬с/┘д└√┬╘╡─┐р─╒.txt
   312084  2012-12-29 12:22   ╢л╥░╣ч╬с/╩╣├№╙ы╨─╡─╝л╧▐.txt
   223452  2012-12-29 12:36   ╢л╥░╣ч╬с/╒ь╠╜┘д└√┬╘.txt
   309774  2012-12-29 12:22   ╢л╥░╣ч╬с/╨┼.txt
   380742  2012-12-29 12:21   ╢л╥░╣ч╬с/╖╓╔э.txt
   211955  2012-12-29 12:22   ╢л╥░╣ч╬с/╩о╥╗╫╓╔▒╚╦.txt
   483969  2012-12-29 12:20   ╢л╥░╣ч╬с/╡е┴╡.txt
   271727  2012-12-29 12:19   ╢л╥░╣ч╬с/▒ф╔э.txt
   276327  2012-12-29 12:22   ╢л╥░╣ч╬с/├√╒ь╠╜╡─╣ц╠ї.txt
   186715  2012-12-29 12:47   ╢л╥░╣ч╬с/├√╒ь╠╜╡─╫ч╓ф.txt
   190776  2012-12-29 12:21   ╢л╥░╣ч╬с/╗╪└╚═д╔▒╚╦╩┬╝■.txt
   346432  2012-12-29 12:22   ╢л╥░╣ч╬с/╩е┼о╡─╛╚╝├.txt
   174262  2012-12-29 12:47   ╢л╥░╣ч╬с/╠ь╩╣╓о╢·.txt
   295093  2012-12-29 12:22   ╢л╥░╣ч╬с/╧╙╥╔╖╕X╡─╧╫╔э.txt
   279797  2012-12-29 12:23   ╢л╥░╣ч╬с/╦▐├№.txt
   113275  2012-12-29 12:22   ╢л╥░╣ч╬с/╔┘┼о╬п═╨╚╦.txt
   128836  2012-12-29 12:20   ╢л╥░╣ч╬с/▓╝╣╚─ё╡─╡░╩╟╦н╡─.txt
   281490  2012-12-29 12:46   ╢л╥░╣ч╬с/▓╝┬│╠╪╦╣╡─╨─╘р.txt
   570664  2012-12-29 12:21   ╢л╥░╣ч╬с/╗├╥╣.txt
   175549  2012-12-29 12:46   ╢л╥░╣ч╬с/╣╓╚╦├╟.txt
   185335  2012-12-29 12:29   ╢л╥░╣ч╬с/╣╓╨ж╨б╦╡.txt
   248421  2012-12-29 12:20   ╢л╥░╣ч╬с/╢ё╥т.txt
   237937  2012-12-29 12:21   ╢л╥░╣ч╬с/╖┼╤з║є.txt
   243242  2012-12-29 12:32   ╢л╥░╣ч╬с/╨┬▓╬╒▀.txt
   550260  2012-12-29 12:22   ╢л╥░╣ч╬с/╔▒╚╦╓о├┼.txt
   231383  2012-12-29 12:29   ╢л╥░╣ч╬с/╢╛╨ж╨б╦╡.txt
   217015  2012-12-29 12:19   ╢л╥░╣ч╬с/▒╧╥╡╟░╔▒╚╦╙╬╧╖.txt
   252810  2012-12-29 12:46   ╢л╥░╣ч╬с/│┴╦п╡─╔н┴╓.txt
   235399  2012-12-29 12:38   ╢л╥░╣ч╬с/├╗╙╨╨╫╩╓╡─╔▒╚╦╥╣.txt
   383936  2012-12-29 12:21   ╢л╥░╣ч╬с/┴ў╨╟╓о░э.txt
   192384  2012-12-29 12:21   ╢л╥░╣ч╬с/║■▒▀╨╫╔▒░╕.txt
   308932  2012-12-29 12:20   ╢л╥░╣ч╬с/▒Ї╦└╓о╤█.txt
   267500  2012-12-29 12:45   ╢л╥░╣ч╬с/░╫┬э╔╜╫п╔▒╚╦╩┬╝■.txt
   366169  2012-12-29 12:39   ╢л╥░╣ч╬с/├╪├▄.txt
   226259  2012-12-29 12:21   ╢л╥░╣ч╬с/║ь╩╓╓╕.txt
   243301  2012-12-29 12:19   ╢л╥░╣ч╬с/░є╝▄╙╬╧╖.txt
   190787  2012-12-29 12:20   ╢л╥░╣ч╬с/│мбд╔▒╚╦╩┬╝■.txt
   192098  2012-12-29 12:23   ╢л╥░╣ч╬с/╤й╡╪╔▒╗·.txt
   182246  2012-12-29 12:23   ╢л╥░╣ч╬с/╘д╓к├╬.txt
   204722  2012-12-29 12:21   ╢л╥░╣ч╬с/║┌╨ж╨б╦╡.txt
---------                     -------
 10621981                     43 files

It is not any language, definitely not Chinese. So let’s make a guess that this zip is encoded in GBK(cause I know it is simplified Chinese, so I will make a guess that it is Guobiao, which can be abbriviated as GB. And the next guess is naturally GBK as it is popularized by Microsoft–most users use Microsoft–with the standard CP 936).

➜  Downloads unzip -O GBK -l 东野圭吾.zip
Archive:  东野圭吾.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2012-12-29 12:47   东野圭吾/
    49230  2012-12-29 12:44   东野圭吾/以眨眼干杯.txt
    40890  2012-12-29 12:23   东野圭吾/伊豆旅馆的神秘案.txt
   158806  2012-12-29 12:45   东野圭吾/伽利略的苦恼.txt
   312084  2012-12-29 12:22   东野圭吾/使命与心的极限.txt
   223452  2012-12-29 12:36   东野圭吾/侦探伽利略.txt
   309774  2012-12-29 12:22   东野圭吾/信.txt
   380742  2012-12-29 12:21   东野圭吾/分身.txt
   211955  2012-12-29 12:22   东野圭吾/十一字杀人.txt
   483969  2012-12-29 12:20   东野圭吾/单恋.txt
   271727  2012-12-29 12:19   东野圭吾/变身.txt
   276327  2012-12-29 12:22   东野圭吾/名侦探的规条.txt
   186715  2012-12-29 12:47   东野圭吾/名侦探的诅咒.txt
   190776  2012-12-29 12:21   东野圭吾/回廊亭杀人事件.txt
   346432  2012-12-29 12:22   东野圭吾/圣女的救济.txt
   174262  2012-12-29 12:47   东野圭吾/天使之耳.txt
   295093  2012-12-29 12:22   东野圭吾/嫌疑犯X的献身.txt
   279797  2012-12-29 12:23   东野圭吾/宿命.txt
   113275  2012-12-29 12:22   东野圭吾/少女委托人.txt
   128836  2012-12-29 12:20   东野圭吾/布谷鸟的蛋是谁的.txt
   281490  2012-12-29 12:46   东野圭吾/布鲁特斯的心脏.txt
   570664  2012-12-29 12:21   东野圭吾/幻夜.txt
   175549  2012-12-29 12:46   东野圭吾/怪人们.txt
   185335  2012-12-29 12:29   东野圭吾/怪笑小说.txt
   248421  2012-12-29 12:20   东野圭吾/恶意.txt
   237937  2012-12-29 12:21   东野圭吾/放学后.txt
   243242  2012-12-29 12:32   东野圭吾/新参者.txt
   550260  2012-12-29 12:22   东野圭吾/杀人之门.txt
   231383  2012-12-29 12:29   东野圭吾/毒笑小说.txt
   217015  2012-12-29 12:19   东野圭吾/毕业前杀人游戏.txt
   252810  2012-12-29 12:46   东野圭吾/沉睡的森林.txt
   235399  2012-12-29 12:38   东野圭吾/没有凶手的杀人夜.txt
   383936  2012-12-29 12:21   东野圭吾/流星之绊.txt
   192384  2012-12-29 12:21   东野圭吾/湖边凶杀案.txt
   308932  2012-12-29 12:20   东野圭吾/濒死之眼.txt
   267500  2012-12-29 12:45   东野圭吾/白马山庄杀人事件.txt
   366169  2012-12-29 12:39   东野圭吾/秘密.txt
   226259  2012-12-29 12:21   东野圭吾/红手指.txt
   243301  2012-12-29 12:19   东野圭吾/绑架游戏.txt
   190787  2012-12-29 12:20   东野圭吾/超·杀人事件.txt
   192098  2012-12-29 12:23   东野圭吾/雪地杀机.txt
   182246  2012-12-29 12:23   东野圭吾/预知梦.txt
   204722  2012-12-29 12:21   东野圭吾/黑笑小说.txt
---------                     -------
 10621981                     43 files

OK. This time the output is OK already. So I unzipped the file with unzip -O GBK 东野圭吾.zip. Then I opened one novel in vim and start reading–well, the content is also GBK most likely. So let’s convert files to UTF for vim and other programs like cat and less to process the file(my system language setting is UTF-8).

for d in $(ls ./); do iconv -f GBK -t UTF-8 $d > "encoded$d"; rm $d; mv "encoded$d" $d; done

Tada! Now I can enjoy novels!

References:

  1. unicode support from file roller?
  2. Best way to convert text files between character sets?