小问题探讨一下,Native API预训练模型离线运行

今天在自己的笔记本上装起了HanLP2.1.0b27,我的笔记本有Nvidia GPU,不过
pip install hanlp 装的PyTorch是CPU版,pip install hanlp[tf]装的TensorFlow也是CPU版,我是先建了anaconda3的虚拟环境,装好了tensorflow-2.6.0-gpu以及torch-1.11.0+cuda11.3,然后再用pip install hanlp装的。所以怎样安装GPU版的HanLP,可能需要在文档里补充一个教程。
然后官方的文档tensoflow-2.6.0-gpu只确认了支持cuda11.2,torch1.11.0只支持cuda11.3,torch没有支持cuda11.2的版本,搜了一下看到有人在cuda11.3上跑起了tensorflow-2.6.0-gpu,就用了cuda11.3。
然后跑一些离线的模型,先在线跑,下载一次到本地%HANLP_HOME,然后再离线跑,发觉有些模型引用了https://huggingface.co/上的 transformer模型,仍然是每次要连线去下载。有些应用场景需要部署在内网运行,所以需要Native API离线运行。参考了这个帖子,把那些transformer模型下载到本地%HANLP_HOME下,然后修改%HANLP_HOME中这些模型对transformer模型的引用,改到本地%HANLP_HOME下,解决了这个问题。建议在产品设计时增加一些开关或参数,Native API离线运行的话指出在本地载入引用模型的目录。例如

import hanlp
syntactic_parser = hanlp.load(hanlp.pretrained.dep.CTB9_DEP_ELECTRA_SMALL)

dep/ctb9_dep_electra_small_20220216_100306/config.json

{
  "average_subword": true,
  "average_subwords": false,
  "batch_size": null,
  "decay": 0.75,
  "decay_steps": 5000,
  "embed_dropout": 0.33,
  "encoder_lr": 0.0001,
  "epochs": 30,
  "epsilon": 1e-12,
  "feat": null,
  "finetune": false,
  "grad_norm": 1.0,
  "gradient_accumulation": 2,
  "hidden_dropout": 0.33,
  "layer_dropout": 0,
  "lowercase": false,
  "lr": 0.001,
  "max_sequence_length": 512,
  "min_freq": 2,
  "mlp_dropout": 0.33,
  "mu": 0.9,
  "n_embed": 100,
  "n_lstm_hidden": 400,
  "n_lstm_layers": 3,
  "n_mlp_arc": 500,
  "n_mlp_rel": 100,
  "n_rels": 46,
  "nu": 0.9,
  "patience": 30,
  "pretrained_embed": null,
  "proj": true,
  "punct": true,
  "sampler_builder": {
    "classpath": "hanlp.common.dataset.SortingSamplerBuilder",
    "use_effective_tokens": false,
    "batch_max_tokens": null,
    "batch_size": 32
  },
  "scalar_mix": null,
  "secondary_encoder": null,
  "seed": 1644964061,
  "separate_optimizer": false,
  "transform": {
    "classpath": "hanlp.common.transform.NormalizeCharacter",
    "dst": "token",
    "src": "token",
    "mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
  },
  "transformer": "hfl/chinese-electra-180g-small-discriminator",
  "transformer_hidden_dropout": null,
  "transformer_lr": 5e-05,
  "tree": true,
  "unk": "<unk>",
  "warmup_steps": 0.1,
  "weight_decay": 0,
  "word_dropout": 0.1,
  "classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
  "hanlp_version": "2.1.0-beta.15"
}

transformer引用改成下面就是从本地离线载入了:

{
  "average_subword": true,
  "average_subwords": false,
  "batch_size": null,
  "decay": 0.75,
  "decay_steps": 5000,
  "embed_dropout": 0.33,
  "encoder_lr": 0.0001,
  "epochs": 30,
  "epsilon": 1e-12,
  "feat": null,
  "finetune": false,
  "grad_norm": 1.0,
  "gradient_accumulation": 2,
  "hidden_dropout": 0.33,
  "layer_dropout": 0,
  "lowercase": false,
  "lr": 0.001,
  "max_sequence_length": 512,
  "min_freq": 2,
  "mlp_dropout": 0.33,
  "mu": 0.9,
  "n_embed": 100,
  "n_lstm_hidden": 400,
  "n_lstm_layers": 3,
  "n_mlp_arc": 500,
  "n_mlp_rel": 100,
  "n_rels": 46,
  "nu": 0.9,
  "patience": 30,
  "pretrained_embed": null,
  "proj": true,
  "punct": true,
  "sampler_builder": {
    "classpath": "hanlp.common.dataset.SortingSamplerBuilder",
    "use_effective_tokens": false,
    "batch_max_tokens": null,
    "batch_size": 32
  },
  "scalar_mix": null,
  "secondary_encoder": null,
  "seed": 1644964061,
  "separate_optimizer": false,
  "transform": {
    "classpath": "hanlp.common.transform.NormalizeCharacter",
    "dst": "token",
    "src": "token",
    "mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
  },
  "transformer": "D:/HanLP/transformers/hlf/chinese-electra-180g-small-discriminator",
  "transformer_hidden_dropout": null,
  "transformer_lr": 5e-05,
  "tree": true,
  "unk": "<unk>",
  "warmup_steps": 0.1,
  "weight_decay": 0,
  "word_dropout": 0.1,
  "classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
  "hanlp_version": "2.1.0-beta.15"
}

一点小建议,也许可以让产品更易用好用一点。

然后还有一个小问题,不知道是否我自己运行环境的问题,如果先加载了PyTorch预训练模型,再加载TensorFlow预训练模型就会出错,单独加载二个都没有问题。

import hanlp
# 用于挂载用户自定义词典分词,
# 发票货物劳务名称中有大量的专有名词,并不常见于训练用的语料库中,导致分词不够准确,需要发票专用的自定义词典。
from hanlp.components.tokenizers.transformer import TransformerTaggingTokenizer
# 用于为上面的用户自定义词典作词性标注,提取货物劳务名称需要标注词性,货物劳务名称显然都是名词,这是个名词短语。
from hanlp.components.mtl.tasks.pos import TransformerTagging
# 这两个是PyTorch模型
tokenizer: TransformerTaggingTokenizer = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)
tagger: TransformerTagging = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)
# 这个是TensorFlow模型
tokenizer = hanlp.load(hanlp.pretrained.tok.LARGE_ALBERT_BASE)

错误信息:

ImportError: 
TFAutoModel requires the TensorFlow library but it was not found in your environment. Checkout the instructions on the
installation page: https://www.tensorflow.org/install and follow the ones that match your environment.

我还没有深入了解源码,所以暂时不知道出错的原因。

O.K.,刚看到了安装文档里有关 Hugging Face :hugs: Transformers Models的部分。

export TRANSFORMERS_OFFLINE=1

Server without Internet

If your server has no Internet access at all, just debug your codes on your local PC and copy the following directories to your server via a USB disk or something.

  1. ~/.hanlp: the home directory for HanLP models.
  2. ~/.cache/huggingface: the home directory for Hugging Face :hugs: Transformers.

请阅读demo:

O.K., it works.