HanLP 2.1.0-beta Release

hankcs · December 29, 2021, 3:26am

经过整整一年alpha阶段的开发，HanLP终于发布了beta版，标志着API的逐步稳定与整体框架的基本定型。相较于alpha版，beta版有如下最显著的改进：

更快捷的安装。pip install hanlp[full]现在不再需要编译器（Visual C++等），直接就能装上。
统一的API语义：通过hanlp.pipeline()混合规则函数、TensorFlow、PyTorch三者交火。

hankcs/HanLP/blob/0a3a9b506abc887b9533f737a7a7128d3b6cd935/plugins/hanlp_demo/hanlp_demo/zh/demo_pipeline.py#L11

    
      
          # -*- coding:utf-8 -*-
          # Author: hankcs
          # Date: 2021-12-28 20:47
          import hanlp
          
          
# Pipeline allows to blend multiple callable functions no matter they are a rule, a TensorFlow component or a PyTorch
          # one. However, it's slower than the MTL framework.
          pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ALBERT_BASE)  # In case both tf and torch are used, load tf first
          tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
          
          
pipeline = hanlp.pipeline() \
              .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
              .append(tok, output_key='tok') \
              .append(pos, output_key='pos')
          
          
doc = pipeline('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。')
          print(doc)
          doc.pretty_print()

统一的Transformer：TensorFlow和PyTorch组件现在统一依赖transformers提供，不再依赖bert-for-tf2。
修复TensorFlow的内存泄露bug:

github.com/tensorflow/tensorflow

Memory Leak in tf.data.Dataset.from_generator

opened 02:40AM - 17 Mar 20 UTC

closed 02:51AM - 24 Dec 21 UTC

hankcs

stat:awaiting response comp:data type:performance TF 2.7

<em>Please make sure that this is a bug. As per our [GitHub Policy](https://git…hub.com/tensorflow/tensorflow/blob/master/ISSUES.md), we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template</em> **System information** - Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS - Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: - TensorFlow installed from (source or binary): binary - TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de 2.1.0 - Python version: Python 3.6.6 - Bazel version (if compiling from source): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: CUDA Version: 10.1 cudnn-10.1 - GPU model and memory: TITAN RTX 24190MiB You can collect some of this information using our environment capture [script](https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh) You can also obtain the TensorFlow version with: 1. TF 1.0: `python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"` 2. TF 2.0: `python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"` **Describe the current behavior** `tf.data.Dataset.from_generator` leaks memory after each call even if followed by `gc.collect()`. **Describe the expected behavior** Memory should be released when no reference exists for the dataset. **Standalone code to reproduce the issue** Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook. ``` import gc import os os.environ["CUDA_VISIBLE_DEVICES"] = "-1" import tensorflow as tf import tracemalloc import linecache def display_top(snapshot, key_type='lineno', limit=3): snapshot = snapshot.filter_traces(( tracemalloc.Filter(False, "<frozen importlib._bootstrap>"), tracemalloc.Filter(False, "<unknown>"), )) top_stats = snapshot.statistics(key_type) print("Top %s lines" % limit) for index, stat in enumerate(top_stats[:limit], 1): frame = stat.traceback[0] # replace "/path/to/module/file.py" with "module/file.py" filename = os.sep.join(frame.filename.split(os.sep)[-2:]) print("#%s: %s:%s: %.1f KiB" % (index, filename, frame.lineno, stat.size / 1024)) line = linecache.getline(frame.filename, frame.lineno).strip() if line: print(' %s' % line) other = top_stats[limit:] if other: size = sum(stat.size for stat in other) print("%s other: %.1f KiB" % (len(other), size / 1024)) total = sum(stat.size for stat in top_stats) print("Total allocated size: %.1f KiB" % (total / 1024)) def generator(): yield tf.zeros(2, 3) tracemalloc.start() for i in range(1000): dataset = tf.data.Dataset.from_generator(generator, output_types=tf.int32, output_shapes=[None]) del dataset gc.collect() snapshot = tracemalloc.take_snapshot() display_top(snapshot) ``` **Other info / logs** Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. ``` Top 3 lines #1: python3.6/_weakrefset.py:84: 159.5 KiB self.data.add(ref(item, self._remove)) #2: python3.6/_weakrefset.py:37: 38.2 KiB self.data = set() #3: python3.6/_weakrefset.py:48: 32.4 KiB self._iterating = set() 461 other: 306.4 KiB Total allocated size: 536.4 KiB Top 3 lines #1: python3.6/_weakrefset.py:84: 159.5 KiB self.data.add(ref(item, self._remove)) #2: python3.6/_weakrefset.py:37: 38.2 KiB self.data = set() #3: python3.6/_weakrefset.py:48: 32.4 KiB self._iterating = set() 516 other: 343.1 KiB Total allocated size: 573.1 KiB ... Top 3 lines #1: python3.6/weakref.py:335: 257.8 KiB self = ref.__new__(type, ob, callback) #2: debug/tf_dataset_memory_leak.py:45: 189.7 KiB dataset = tf.data.Dataset.from_generator(generator, output_types=tf.int32, output_shapes=[None]) #3: ops/script_ops.py:257: 174.7 KiB return "pyfunc_%d" % uid 519 other: 2423.3 KiB Total allocated size: 3045.5 KiB ``` It leaks 3MB in 1000 calls. In [some real projects](https://github.com/hankcs/HanLP/issues/1437), it can leak as much as 5GB and keeps increasing.

限于服务器硬盘容量，alpha阶段训练的部分模型将会在近期删除，敬请旧版本用户尽快升级：

pip install hanlp -U

macropodus · December 30, 2021, 2:35am

最好的模型还是之前基于 ernie-ngram训练的多任务模型不，这个有更新吗
mtl = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH, devices=-1)

AliBug · December 30, 2021, 2:51am

期待带 AMR 的模型早日上线

hankcs · December 30, 2021, 4:18am

是的，暂时没有更新。有计划进一步校对语料库和试验新的MTL技术。

hankcs · December 30, 2021, 4:19am

预计在正式版中发布。之前遇到了一些困难，一些英文上好用的模型在中文上效果不好，需要再研究。

asp · December 30, 2021, 3:19pm

强！！我想用java试试

mxq-151 · January 1, 2022, 7:05am

新年好，请问有更详细的release不

hankcs · January 5, 2022, 3:45am

你指的是文档吗？上这里看吧：HanLP: Han Language Processing — HanLP Documentation