应用HanLP从企业经营范围到行业代码建模之一开题

jean2020 · March 19, 2024, 7:25am

此项研究从工商登记中的数据，用HanLP的分词、词性标注、短语句法分析、语义角色分析等功能，建立从经营范围到行业分类的分类模型。这篇文章先梳理研究思路，完成数据的预处理，分享一下。
在ChatGPT等大预言模型爆发的当下，HanLP这类传统的NLP技术依然有适用的场景。ChatGPT类大模型适用于不确定任务的自然语言界面，比如聊天，AI客服，人机交互界面等；传统的NLP技术适用于特定的自然语言分析任务，比如本课题的企业经营范围是高度结构化的短语，需要综合运用NLP等各种技术去完成特定的任务。
《应用HanLP从企业经营范围到行业代码建模之一开题》。

Hiki · March 19, 2024, 2:46pm

很赞！从之前你发票货劳名称研究的方案从你这边学习了一些思路，目前在研究从hanlp 突破公司内部经营数据问答的语义理解，希望能向你多学习学习

jean2020 · March 20, 2024, 3:01am

咱们多交流，互相学习！
我是看到ChatGPT类大模型的爆发会对传统NLP的产业会造成巨大的冲击，所以有传统NLP的落地应用可能就分享一下。其实分词、词性标注这些传统NLP技术是大模型的基础，它们是在更高的维度上再经过巨型深度学习神经网络(transformer)的训练，应该可以相得益彰。

Hiki · March 21, 2024, 2:35am

大神谦虚了，本人是小白，还是要多向大神学习，有个小小的问题，请问大神有处理过分词模型的微调，因为私有领域的专门名词太多，一直循环加自定义词典这条路觉得不靠谱，想通过模型训练获取私有领域的分词。

jean2020 · March 21, 2024, 3:48am

这个还没有处理过，因为：
1、我还没有私有领域的专有名词标注数据，这是需要相当大人力资源的工作，或者要外包给专业的数据标注公司进行，也是需要数据资源资金资源的工作。
2、还没有微调训练过深度学习预训练NLP模型，虽然HanLP上有教程。
所以我提出来的解决方案只是探讨这个技术路线的可行性，不过我觉得应该是可行的。
的确行业垂直应用的话，用标注过的专业领域语料库微调训练模型是首选，然后可以在两次微调之间加用户自定义词典临时应急，或者用于处理微调后仍然处理不了的专业术语。

Hiki · March 28, 2024, 1:40am

大神你好，你有增加大量的私有领域的字典吗，我这边大概600w的词量，自定义字典增加后，内存占用了快30G了，非常影响性能，是否你这边有更好的处理方式加载自定义字典呢

jean2020 · March 28, 2024, 1:51am

没有，我这个只是概念模型研究，还没有资源去真正进入迭代开发。600W词量，自定义词典大概会有多少条？是否用行业语料微调训练更合适？另外，Python的字典与Pandas DataFrame要额外消耗很多的内存，尤其是对于NLP算法产生的分词、词性标注等大量的短String，你这30G中真正是你业务词汇的可能不到6G，然后Python无法及时回收底层C++库自己管理的内存，导致占用大量的内存。可以考虑分析时把中间结果存储到Parquet文件中，然后用PyArrow库分块读入处理，这样能节省大量的内存，这相当于单机串行的Map Reduce过程，具体可以参阅我更新后的知乎文章。比如：

#加载数据 ---------------------------------------------------------------------------------
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
import time, os, gc
import json
import psutil
import objgraph
import numpy as np
import pandas as pd
import re
from igraph import *
import matplotlib.pyplot as plt

parquet_file_path = 'D:/temp/data/工商登记/gsdj_parsed.parquet'

# global referenced data, e.g., id, com_industry, and so on.
t1 = time.time()
table = pq.read_table(parquet_file_path, columns=['id','com_industry'])
df = table.to_pandas()
del table
gc.collect()
t2 = time.time()
print(t2-t1)

# Read the Parquet file with Dask, load a little subset for testing.
ddf = dd.read_parquet(parquet_file_path, columns=['id','com_industry','parse_results'])
# Compute and get the first 100 rows as a pandas DataFrame
df1 = ddf.head(100)

# global cumulative results, some analysis such as word counts, and so on.
cumulative_results={'wordsCount':{},'posCount':{}, 'action':{}, 'target':{}}


# Adjust these values based on your dataset and system's memory capacity
batch_size = 100000  

# Function to update cumulative results based on parsed JSON and possibly using global references
def update_cumulative_results(parsed_data, i, j):
    # Update your cumulative_results using parsed_data and index of it.
    start_index = i * batch_size+ j
    end_index = start_index + len(parsed_data)
    print(f'Processing {start_index} to {end_index-1}.')
    # Do some analysis and update cumulative results.
    for j in range(0,len(parsed_data)):
        dic = parsed_data[j]
        myanalysis(dic, start_index+j)

# print(df1.iloc[0,2])
#dic = result2dic(df1.iloc[0,2])
#dic = result2dic(df1.iloc[3,2])

# Do some analysys   
def myanalysis(dic, index):
    wordsCount = cumulative_results['wordsCount']
    posCount = cumulative_results['posCount']
    actions = cumulative_results['action'] 
    targets = cumulative_results['target']
    # 有个别的经营范围为空，或没有分析出来
    try:
        words = dic['tok']; poss = dic['pos']
        for word, pos in zip(words, poss):
            try:
                wordsCount[word] += 1
            except Exception as e:
                wordsCount[word] = 1
            try:
                posCount[pos] += 1
            except Exception as e:
                posCount[pos] = 1  
            # 经营活动是动词，看看都有些什么经营活动
            if re.search('V',pos):
                try:
                    actions[word] += 1
                except Exception as e:
                    actions[word] = 1
                # 经营对象是名词，看看都有些什么经营对象                
            if re.search('N',pos):
                try:
                    targets[word] += 1
                except Exception as e:
                    targets[word] = 1
    except Exception as e:
        pass                 
            

# Process Parquet in batches, but only for the JSON strings column this time
t1 = time.time()
parquet_file = pq.ParquetFile(parquet_file_path)
for i, batch in enumerate(parquet_file.iter_batches(batch_size=batch_size, columns=['parse_results'])):
    table = pa.Table.from_batches([batch])
    json_strs = table.column('parse_results').to_pylist()

    # Inner batch processing for JSON strings, similar to before
    chunk_size = 10000  # Adjust based on the complexity of JSON and available memory
    for j in range(0, len(json_strs), chunk_size):
        chunk = json_strs[j:j+chunk_size]
        parsed_chunk = [result2dic(item) for item in chunk]
        
        # Update cumulative results with data from this chunk
        update_cumulative_results(parsed_chunk, i, j)
        
        # Memory management
        del chunk, parsed_chunk
        # gc.collect()
        t2 = time.time()
        print(f'Batch {i} chunk {int(j/10000)} processed, {np.round(t2-t1)} seconds.')

    # Memory management
    del table, json_strs
    gc.collect()

# 查看动词的情况
actions = pd.DataFrame({'action':cumulative_results['action'].keys(), 'count':cumulative_results['action'].values()})
actions.sort_values(by=['count'], ascending=False, inplace = True)
actions.reset_index(drop = True, inplace = True)
actions.to_csv("D:/temp/data/工商登记/verbs.csv", header=True, index=False, encoding='utf-8')

# 查看名词的情况
targets = pd.DataFrame({'target':cumulative_results['target'].keys(), 'count':cumulative_results['target'].values()})
targets.sort_values(by=['count'], ascending=False, inplace = True)
targets.reset_index(drop = True, inplace = True)
targets.to_csv("D:/temp/data/工商登记/targets.csv", header=True, index=False, encoding='utf-8')

# 查看名词的情况
poss = pd.DataFrame({'posCount':cumulative_results['posCount'].keys(), 'count':cumulative_results['posCount'].values()})
poss.sort_values(by=['count'], ascending=False, inplace = True)
poss.reset_index(drop = True, inplace = True)


# matplotlib作图中文设置
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号
plt.figure(figsize=(12, 16))#6，8分别对应宽和高
fig, ax = plt.subplots()
ax.bar(list(poss['posCount']), list(poss['count']))
ax.set_ylabel('词频')
ax.set_title('词性标注统计')
plt.xticks(rotation=90)
plt.show()

Hiki · March 28, 2024, 1:58am

之前我的想法也是使用微调训练私有领域的，但就如你上面说的需要相当大人力资源的工作去进行标注，也是一个很头疼的事，毕竟我们是应用者，资源有限，谢谢大神，我在看看是否有更多的方式去处理自定义词典

jean2020 · March 28, 2024, 2:06am

参阅上面更新的回复，这是我正在做的研究采用的处理方法。

Hiki · March 28, 2024, 6:37am

非常感谢大神的建议，我仔细研读下哈，感谢感谢

应用HanLP从企业经营范围到行业代码建模之一 开题

应用HanLP从企业经营范围到行业代码建模之一开题