ch2 ch3 :UnicodeDecodeError;显示乱码;输出不一致

何老师和各位群友晚上好!打扰大家一下,这几处跟着教材做,发现报错的地方,没想出来什么原因,几经周折,也没搜到相关的解答,还望老师和各位群友不吝赐教!非常感谢!2020-04-08T16:00:00Z

1.第二章P87直接运行demo_stopwords.py,报错:
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch02/demo_stopwords.py

Traceback (most recent call last):

File “”, line 58, in
trie = load_from_file(HanLP.Config.CoreStopWordDictionaryPath)

File “”, line 14, in load_from_file
for word in src:

UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x90 in position 2519: illegal multibyte sequence
———————————————————————————————————————–————
2.第三章P99直接运行sighan05_statistics.py,报错:
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/sighan05_statistics.py

Traceback (most recent call last):

File “”, line 38, in
(data.upper(),) + count_corpus(train_path, test_path)))

File “”, line 10, in count_corpus
train_counter, train_freq, train_chars = count_word_freq(train_path)

File “”, line 22, in count_word_freq
for line in src:

UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x84 in position 26: illegal multibyte sequence

————————————————————————————————————————————
3.第三章P101直接运行demo_corpus_loader.py,中文显示乱码
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/demo_corpus_loader.py

————————————————————————————————————————————
4.第三章P103my_cws_model.txt中文显示乱码:
…\static\data\test\my_cws_model.txt
————————————————————————————————————————————
5.第三章P114运行ngram_segment.py结果不一致:
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/ngram_segment.py

我的输出:
[’ ', ‘商品和服’, ‘务’, ’ ']

教材为:
[’ ‘, ‘商品’, ‘和’, 服务’, ’ ']


——————————————————————————————————
6.第三章P117demo_custom_dict.py运行结果与教材不一致:
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/demo_custom_dict.py

我的输出:
不挂载词典: [社会摇摆简称社会摇/n]
低优先级词典: [社会摇摆简称社会摇/n]
高优先级词典: [社会摇/nz, 摆简称/n, 社会摇/nz]

教材输出:
不挂载词典: [社会/n, 摇摆/v, 简称/v, 社会/n,摇/n]
低优先级词典: [社会/n, 摇摆/v, 简称/v, 社会摇/nz]
高优先级词典: [社会摇/nz, 摆简称/n, 社会摇/nz]
————————————————————————————————————

对于提问3和提问4,修改demo_corpus_loader.py中的代码with open(corpus_path, ‘w’)为:with open(corpus_path, ‘w’,encoding=‘utf-8’),重新运行一遍demo_corpus_loader.py和ngram_segment.py,中文显示正常。

https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/demo_corpus_loader.py
https://github.com/hankcs/pyhanlp/blob/master/tests/book/ch03/ngram_segment.py

2 Likes

感谢反馈,请检查下列补丁是否解决问题:

可能需要删除自定义词典bin文件。

1.问题1-4,按照何老师的提供的补丁(即修改代码中"open …"语句加入’encoding=utf-8’),运行正常。

2.问题5-6,删除自定义词典bin文件,输出结果一致。
…pyhanlp/static/data/test/my_cws_model.txt.bin

非常感谢何老师!得到以下启示:
1.下次遇到中文读取乱码,可以考虑修改代码中"open …"语句加入’encoding=utf-8’。
2.输出不一致时,需要思考代码运行的逻辑,并去检查涉及的代码,尤其是加载的自定义部分。