파이썬 소스코드의 정확한 문자셋 식별 방법 [파이썬3]

soyeomul · 3월 11, 2020, 2:38오전

# -*- coding: utf-8 -*-

# 파이썬3 에선 모든 문자열을 UTF-8 로 처리합니다
# 그래서 소스코드가 UTF-8 이 아닌경우에는 실행시 에러가 발생합니다
# 그래서 
# 한국어 환경에서 프로그래밍을 할때에,
# 파이썬 코드에 입력된 문자가 정확히 어떤 문자셋인지 식별을 도와주는 코드입니다

# 본 코드는 우분투 18.04 LTS 파이썬 3.6.9 에서 실험 했습니다.
# 실험을 위하야 작업디렉토리에 임의의 파일 cp949/ascii/utf-8 등 다양한 파일들을 준비했습니다

import io
import subprocess

"""
참고문헌: 
[1] https://docs.python.org/ko/3/library/codecs.html#standard-encodings
[2] https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text
"""

# 한국어환경과 관련된 모든 인코딩
encodings = [
    "ascii",
    "cp949",
    "euc_kr",
    "iso2022_kr",
    "utf_8",
]

def coding_id(file):
    for e in encodings:
        try:
            fh = io.open(file, "r", encoding=e)
            fh.readlines()
            fh.seek(0)
        except UnicodeDecodeError:
            print("got unicode error with %s, trying different encoding" % e)
        else:
            print("opening the file [%s] with encoding: [%s] \n" % (file, e))
            break

search_py = "ls *.py" # 작업디렉토리에서 파이썬 소스코드 파일만 찾습니다
cmd_ls = subprocess.Popen(search_py, stdout=subprocess.PIPE, shell=True)
output = cmd_ls.communicate()[0]
file_list = output.decode("utf-8").split()

for file in file_list:
    coding_id(file)

# EOF

[우분투 18.04 파여폭스 나비에서 작성했습니다]

soyeomul · 3월 11, 2020, 10:33오전

문자셋 변환시키는 코드입니다.

# -*- coding: utf-8 -*-

# 한국어 환경에서 파이썬3로 프로그래밍을 할때에,
# 코드에 입력된 문자셋이 정확히 UTF-8 이 아닐경우, 에러가 발생합니다. 
# 이에 해당 파이썬 소스코드를 정확히 UTF-8 로 변환시켜주는 도움코드입니다.

# 본 코드는 우분투 18.04 LTS 파이썬 3.6.9 에서 실험 했습니다.

import io
import subprocess

"""
참고문헌: 
[1] https://docs.python.org/ko/3/library/codecs.html#standard-encodings
[2] https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text
[3] https://stackoverflow.com/questions/19591458/python-reading-from-a-file-and-saving-to-utf-8/19591815
"""

# 한국어환경과 관련된 인코딩
encodings = [
    "ascii",
    "cp949",
    "euc_kr",
    "iso2022_kr",
    "utf_8",
]

def coding_work(file):
    for e in encodings:
        try:
            fh = io.open(file, "r", encoding=e)
            file_all_text = fh.read()
            fh.readlines()
            fh.seek(0)
            fh.close()
        except UnicodeDecodeError:
            pass
        else:
            print("opening the file [%s] with encoding: [%s]" % (file, e))
            if e == "cp949" or e == "euc_kr" or e == "iso2022_kr":
                with io.open(file, "w", encoding="utf_8") as f:
                    f.write(file_all_text)
                print("\t%s: %s ===> %s" % (file, e, "utf_8"))
            break

search_py = "ls *.py" # 작업디렉토리에서 파이썬 소스코드 파일만 찾습니다
cmd_ls = subprocess.Popen(search_py, stdout=subprocess.PIPE, shell=True)
output = cmd_ls.communicate()[0]
file_list = output.decode("utf-8").split()

for file in file_list:
    coding_work(file)

# EOF

Screenshot from 2020-03-11 19-32-13.png