关于词袋向量导出的问题

EricLiu1994 · July 27, 2020, 10:51am

请教下作者，P299页在做文本聚类过程中得到的词袋向量表(10-1)有办法输出显示到终端吗?研究了下ClusterAnalyzer的源码好像addDocument后返回的是一个对象，并不能直接print或者输出到终端显示。

hankcs · August 1, 2020, 12:53am

addDocument返回的是Document，里面的成员都是protected。你需要继承ClusterAnalyzer自己实现这些功能。

# 在import pyhanlp之前编译自己的Java class，并自动放入pyhanlp/static中
import os

from pyhanlp.static import STATIC_ROOT, HANLP_JAR_PATH

java_code_path = os.path.join(STATIC_ROOT, 'MyClusterAnalyzer.java')
with open(java_code_path, 'w') as out:
    java_code = """
import com.hankcs.hanlp.mining.cluster.ClusterAnalyzer;
import com.hankcs.hanlp.mining.cluster.SparseVector;

public class MyClusterAnalyzer<K> extends ClusterAnalyzer<K>
{
    public SparseVector toVector(String document)
    {
        return toVector(preprocess(document));
    }
}
"""
    out.write(java_code)
os.system('javac -cp {} {} -d {}'.format(HANLP_JAR_PATH, java_code_path, STATIC_ROOT))
# 编译结束才可以启动hanlp
from pyhanlp import *

ClusterAnalyzer = JClass('MyClusterAnalyzer')

if __name__ == '__main__':
    analyzer = ClusterAnalyzer()
    vec = analyzer.toVector("古典, 古典, 古典, 古典, 古典, 古典, 古典, 古典, 摇滚")
    print(vec)

EricLiu1994 · August 14, 2020, 3:58am

作者大大，我在执行上述代码时出现了java.lang.NoClassDefFoundError: MyClusterAnalyzer的错误，请问这个要如何处理呢？

EricLiu1994 · August 14, 2020, 4:00am

我在本地的pyhanlp/static文件中已经看到了MyClusterAnalyzer java文件了，但还是会报这个错误