今天在自己的笔记本上装起了HanLP2.1.0b27,我的笔记本有Nvidia GPU,不过
pip install hanlp 装的PyTorch是CPU版,pip install hanlp[tf]装的TensorFlow也是CPU版,我是先建了anaconda3的虚拟环境,装好了tensorflow-2.6.0-gpu以及torch-1.11.0+cuda11.3,然后再用pip install hanlp装的。所以怎样安装GPU版的HanLP,可能需要在文档里补充一个教程。
然后官方的文档tensoflow-2.6.0-gpu只确认了支持cuda11.2,torch1.11.0只支持cuda11.3,torch没有支持cuda11.2的版本,搜了一下看到有人在cuda11.3上跑起了tensorflow-2.6.0-gpu,就用了cuda11.3。
然后跑一些离线的模型,先在线跑,下载一次到本地%HANLP_HOME,然后再离线跑,发觉有些模型引用了https://huggingface.co/上的 transformer模型,仍然是每次要连线去下载。有些应用场景需要部署在内网运行,所以需要Native API离线运行。参考了这个帖子,把那些transformer模型下载到本地%HANLP_HOME下,然后修改%HANLP_HOME中这些模型对transformer模型的引用,改到本地%HANLP_HOME下,解决了这个问题。建议在产品设计时增加一些开关或参数,Native API离线运行的话指出在本地载入引用模型的目录。例如
import hanlp
syntactic_parser = hanlp.load(hanlp.pretrained.dep.CTB9_DEP_ELECTRA_SMALL)
dep/ctb9_dep_electra_small_20220216_100306/config.json
{
"average_subword": true,
"average_subwords": false,
"batch_size": null,
"decay": 0.75,
"decay_steps": 5000,
"embed_dropout": 0.33,
"encoder_lr": 0.0001,
"epochs": 30,
"epsilon": 1e-12,
"feat": null,
"finetune": false,
"grad_norm": 1.0,
"gradient_accumulation": 2,
"hidden_dropout": 0.33,
"layer_dropout": 0,
"lowercase": false,
"lr": 0.001,
"max_sequence_length": 512,
"min_freq": 2,
"mlp_dropout": 0.33,
"mu": 0.9,
"n_embed": 100,
"n_lstm_hidden": 400,
"n_lstm_layers": 3,
"n_mlp_arc": 500,
"n_mlp_rel": 100,
"n_rels": 46,
"nu": 0.9,
"patience": 30,
"pretrained_embed": null,
"proj": true,
"punct": true,
"sampler_builder": {
"classpath": "hanlp.common.dataset.SortingSamplerBuilder",
"use_effective_tokens": false,
"batch_max_tokens": null,
"batch_size": 32
},
"scalar_mix": null,
"secondary_encoder": null,
"seed": 1644964061,
"separate_optimizer": false,
"transform": {
"classpath": "hanlp.common.transform.NormalizeCharacter",
"dst": "token",
"src": "token",
"mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
},
"transformer": "hfl/chinese-electra-180g-small-discriminator",
"transformer_hidden_dropout": null,
"transformer_lr": 5e-05,
"tree": true,
"unk": "<unk>",
"warmup_steps": 0.1,
"weight_decay": 0,
"word_dropout": 0.1,
"classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
"hanlp_version": "2.1.0-beta.15"
}
transformer引用改成下面就是从本地离线载入了:
{
"average_subword": true,
"average_subwords": false,
"batch_size": null,
"decay": 0.75,
"decay_steps": 5000,
"embed_dropout": 0.33,
"encoder_lr": 0.0001,
"epochs": 30,
"epsilon": 1e-12,
"feat": null,
"finetune": false,
"grad_norm": 1.0,
"gradient_accumulation": 2,
"hidden_dropout": 0.33,
"layer_dropout": 0,
"lowercase": false,
"lr": 0.001,
"max_sequence_length": 512,
"min_freq": 2,
"mlp_dropout": 0.33,
"mu": 0.9,
"n_embed": 100,
"n_lstm_hidden": 400,
"n_lstm_layers": 3,
"n_mlp_arc": 500,
"n_mlp_rel": 100,
"n_rels": 46,
"nu": 0.9,
"patience": 30,
"pretrained_embed": null,
"proj": true,
"punct": true,
"sampler_builder": {
"classpath": "hanlp.common.dataset.SortingSamplerBuilder",
"use_effective_tokens": false,
"batch_max_tokens": null,
"batch_size": 32
},
"scalar_mix": null,
"secondary_encoder": null,
"seed": 1644964061,
"separate_optimizer": false,
"transform": {
"classpath": "hanlp.common.transform.NormalizeCharacter",
"dst": "token",
"src": "token",
"mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
},
"transformer": "D:/HanLP/transformers/hlf/chinese-electra-180g-small-discriminator",
"transformer_hidden_dropout": null,
"transformer_lr": 5e-05,
"tree": true,
"unk": "<unk>",
"warmup_steps": 0.1,
"weight_decay": 0,
"word_dropout": 0.1,
"classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
"hanlp_version": "2.1.0-beta.15"
}
一点小建议,也许可以让产品更易用好用一点。