小问题探讨一下，Native API预训练模型离线运行

jean2020 · May 12, 2022, 9:51am

今天在自己的笔记本上装起了HanLP2.1.0b27，我的笔记本有Nvidia GPU，不过
pip install hanlp 装的PyTorch是CPU版，pip install hanlp[tf]装的TensorFlow也是CPU版，我是先建了anaconda3的虚拟环境，装好了tensorflow-2.6.0-gpu以及torch-1.11.0+cuda11.3，然后再用pip install hanlp装的。所以怎样安装GPU版的HanLP，可能需要在文档里补充一个教程。
然后官方的文档tensoflow-2.6.0-gpu只确认了支持cuda11.2，torch1.11.0只支持cuda11.3，torch没有支持cuda11.2的版本，搜了一下看到有人在cuda11.3上跑起了tensorflow-2.6.0-gpu，就用了cuda11.3。
然后跑一些离线的模型，先在线跑，下载一次到本地%HANLP_HOME，然后再离线跑，发觉有些模型引用了https://huggingface.co/上的 transformer模型，仍然是每次要连线去下载。有些应用场景需要部署在内网运行，所以需要Native API离线运行。参考了这个帖子，把那些transformer模型下载到本地%HANLP_HOME下，然后修改%HANLP_HOME中这些模型对transformer模型的引用，改到本地%HANLP_HOME下，解决了这个问题。建议在产品设计时增加一些开关或参数，Native API离线运行的话指出在本地载入引用模型的目录。例如

import hanlp
syntactic_parser = hanlp.load(hanlp.pretrained.dep.CTB9_DEP_ELECTRA_SMALL)

dep/ctb9_dep_electra_small_20220216_100306/config.json

{
  "average_subword": true,
  "average_subwords": false,
  "batch_size": null,
  "decay": 0.75,
  "decay_steps": 5000,
  "embed_dropout": 0.33,
  "encoder_lr": 0.0001,
  "epochs": 30,
  "epsilon": 1e-12,
  "feat": null,
  "finetune": false,
  "grad_norm": 1.0,
  "gradient_accumulation": 2,
  "hidden_dropout": 0.33,
  "layer_dropout": 0,
  "lowercase": false,
  "lr": 0.001,
  "max_sequence_length": 512,
  "min_freq": 2,
  "mlp_dropout": 0.33,
  "mu": 0.9,
  "n_embed": 100,
  "n_lstm_hidden": 400,
  "n_lstm_layers": 3,
  "n_mlp_arc": 500,
  "n_mlp_rel": 100,
  "n_rels": 46,
  "nu": 0.9,
  "patience": 30,
  "pretrained_embed": null,
  "proj": true,
  "punct": true,
  "sampler_builder": {
    "classpath": "hanlp.common.dataset.SortingSamplerBuilder",
    "use_effective_tokens": false,
    "batch_max_tokens": null,
    "batch_size": 32
  },
  "scalar_mix": null,
  "secondary_encoder": null,
  "seed": 1644964061,
  "separate_optimizer": false,
  "transform": {
    "classpath": "hanlp.common.transform.NormalizeCharacter",
    "dst": "token",
    "src": "token",
    "mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
  },
  "transformer": "hfl/chinese-electra-180g-small-discriminator",
  "transformer_hidden_dropout": null,
  "transformer_lr": 5e-05,
  "tree": true,
  "unk": "<unk>",
  "warmup_steps": 0.1,
  "weight_decay": 0,
  "word_dropout": 0.1,
  "classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
  "hanlp_version": "2.1.0-beta.15"
}

transformer引用改成下面就是从本地离线载入了：

{
  "average_subword": true,
  "average_subwords": false,
  "batch_size": null,
  "decay": 0.75,
  "decay_steps": 5000,
  "embed_dropout": 0.33,
  "encoder_lr": 0.0001,
  "epochs": 30,
  "epsilon": 1e-12,
  "feat": null,
  "finetune": false,
  "grad_norm": 1.0,
  "gradient_accumulation": 2,
  "hidden_dropout": 0.33,
  "layer_dropout": 0,
  "lowercase": false,
  "lr": 0.001,
  "max_sequence_length": 512,
  "min_freq": 2,
  "mlp_dropout": 0.33,
  "mu": 0.9,
  "n_embed": 100,
  "n_lstm_hidden": 400,
  "n_lstm_layers": 3,
  "n_mlp_arc": 500,
  "n_mlp_rel": 100,
  "n_rels": 46,
  "nu": 0.9,
  "patience": 30,
  "pretrained_embed": null,
  "proj": true,
  "punct": true,
  "sampler_builder": {
    "classpath": "hanlp.common.dataset.SortingSamplerBuilder",
    "use_effective_tokens": false,
    "batch_max_tokens": null,
    "batch_size": 32
  },
  "scalar_mix": null,
  "secondary_encoder": null,
  "seed": 1644964061,
  "separate_optimizer": false,
  "transform": {
    "classpath": "hanlp.common.transform.NormalizeCharacter",
    "dst": "token",
    "src": "token",
    "mapper": "https://file.hankcs.com/hanlp/utils/char_table_20210602_202632.json.zip"
  },
  "transformer": "D:/HanLP/transformers/hlf/chinese-electra-180g-small-discriminator",
  "transformer_hidden_dropout": null,
  "transformer_lr": 5e-05,
  "tree": true,
  "unk": "<unk>",
  "warmup_steps": 0.1,
  "weight_decay": 0,
  "word_dropout": 0.1,
  "classpath": "hanlp.components.parsers.biaffine.biaffine_dep.BiaffineDependencyParser",
  "hanlp_version": "2.1.0-beta.15"
}

一点小建议，也许可以让产品更易用好用一点。

jean2020 · May 12, 2022, 10:05am

然后还有一个小问题，不知道是否我自己运行环境的问题，如果先加载了PyTorch预训练模型，再加载TensorFlow预训练模型就会出错，单独加载二个都没有问题。

import hanlp
# 用于挂载用户自定义词典分词，
# 发票货物劳务名称中有大量的专有名词，并不常见于训练用的语料库中，导致分词不够准确，需要发票专用的自定义词典。
from hanlp.components.tokenizers.transformer import TransformerTaggingTokenizer
# 用于为上面的用户自定义词典作词性标注，提取货物劳务名称需要标注词性，货物劳务名称显然都是名词，这是个名词短语。
from hanlp.components.mtl.tasks.pos import TransformerTagging
# 这两个是PyTorch模型
tokenizer: TransformerTaggingTokenizer = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)
tagger: TransformerTagging = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)
# 这个是TensorFlow模型
tokenizer = hanlp.load(hanlp.pretrained.tok.LARGE_ALBERT_BASE)

错误信息：

ImportError: 
TFAutoModel requires the TensorFlow library but it was not found in your environment. Checkout the instructions on the
installation page: https://www.tensorflow.org/install and follow the ones that match your environment.

我还没有深入了解源码，所以暂时不知道出错的原因。

jean2020 · May 13, 2022, 2:45am

O.K.，刚看到了安装文档里有关 Hugging Face Transformers Models的部分。

export TRANSFORMERS_OFFLINE=1

Server without Internet

If your server has no Internet access at all, just debug your codes on your local PC and copy the following directories to your server via a USB disk or something.

~/.hanlp: the home directory for HanLP models.
~/.cache/huggingface: the home directory for Hugging Face Transformers.

hankcs · May 13, 2022, 3:45pm

请阅读demo：

github.com

hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/demo_pipeline.py#L8

      
        
            # -*- coding:utf-8 -*-
            # Author: hankcs
            # Date: 2021-12-28 20:47
            import hanlp
            
            
# Pipeline allows to blend multiple callable functions no matter they are a rule, a TensorFlow component or a PyTorch
            # one. However, it's slower than the MTL framework.
            # pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ALBERT_BASE)  # In case both tf and torch are used, load tf first.
            
            
HanLP = hanlp.pipeline() \
                .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
                .append(hanlp.load('CTB9_TOK_ELECTRA_SMALL'), output_key='tok') \
                .append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
                .append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
                .append(hanlp.load('CTB9_DEP_ELECTRA_SMALL', conll=False), output_key='dep', input_key='tok') \
                .append(hanlp.load('CTB9_CON_ELECTRA_SMALL'), output_key='con', input_key='tok')
            
            
doc = HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。')

jean2020 · May 14, 2022, 1:16am

O.K., it works.