Release a SOTA joint Chinese-English AMR model

hankcs · April 13, 2022, 3:56am

A Permutation-invariant Semantic parser trained on MRP2020 English
and Chinese AMR corpus has been released. It was ranked the top in the MRP2020 competition.

This release made several enhancements on runtime speed and robustness. Please refer to the demo for furthur instructions: HanLP/amr_stl.ipynb at doc-zh · hankcs/HanLP · GitHub

amr-zh
amr-en
All the honors go to the paper authors:

AliBug · April 14, 2022, 11:41am

有段时间没来，发现更新了好多新模型啊！这几天测一下

lizhzh8 · May 5, 2022, 1:54pm

请问新发布的 MRP2020_AMR_ENG_ZHO_XLM_BASE，我们能基于这个CAMR模型再标注新数据进行微调吗？我看到现在貌似还没有这方面的接口

hankcs · May 5, 2022, 2:30pm

可以，训练接口其实已经开源了，只不过代码位于另一个包里：

语料格式也比较复杂，并非penman格式而是mrp格式，需要参考 GitHub - ufal/perin: PERIN is Permutation-Invariant Semantic Parser developed for MRP 2020 安装各种依赖。

lizhzh8 · May 6, 2022, 4:25am

谢谢。
现在能够找到的是penman格式的Chinese Abstract Meaning Representation (CAMR) 2.0语料，里边有三种notation格式的语料。
其中，orgin_camr 是penman格式，mrp2020_camr 是mrp格式的，dependency_tree_camr 是依存句法树的格式，对吗？

下边这个是penman格式。
<class ‘hanlp.components.amr.seq2seq.dataset.penman.AMRGraph’>
(x2 / 希望-01
:arg1 (x4 / 相信-01
:arg0 (x3 / 女孩)
:arg1 x1)
:arg0 (x1 / 男孩))

谢谢了

hankcs · May 6, 2022, 4:31am

基本是的。

仍然是penman格式，需要自己写code转换。mrp格式样例：

github.com

cfmrp/mtool/blob/master/data/sample/amr/wsj.mrp

{"id": "20001001", "flavor": 2, "framework": "amr", "version": 0.9, "time": "2019-04-09 (14:46)", "input": "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.", "tops": [0], "nodes": [{"id": 0, "label": "join-01"}, {"id": 1, "label": "person"}, {"id": 2, "label": "name", "properties": ["op1", "op2"], "values": ["Pierre", "Vinken"]}, {"id": 3, "label": "temporal-quantity", "properties": ["quant"], "values": ["61"]}, {"id": 4, "label": "year"}, {"id": 5, "label": "board"}, {"id": 6, "label": "have-org-role-91"}, {"id": 7, "label": "director"}, {"id": 8, "label": "executive", "properties": ["polarity"], "values": ["-"]}, {"id": 9, "label": "date-entity", "properties": ["month", "day"], "values": ["11", "29"]}], "edges": [{"source": 6, "target": 7, "label": "ARG2"}, {"source": 3, "target": 4, "label": "unit"}, {"source": 5, "target": 6, "label": "ARG1-of", "normal": "ARG1"}, {"source": 7, "target": 8, "label": "mod", "normal": "domain"}, {"source": 0, "target": 5, "label": "ARG1"}, {"source": 1, "target": 3, "label": "age"}, {"source": 0, "target": 9, "label": "time"}, {"source": 1, "target": 2, "label": "name"}, {"source": 6, "target": 1, "label": "ARG0"}, {"source": 0, "target": 1, "label": "ARG0"}]}
{"id": "20001002", "flavor": 2, "framework": "amr", "version": 0.9, "time": "2019-04-09 (14:46)", "input": "Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.", "tops": [0], "nodes": [{"id": 0, "label": "have-org-role-91"}, {"id": 1, "label": "person"}, {"id": 2, "label": "name", "properties": ["op1", "op2"], "values": ["Mr.", "Vinken"]}, {"id": 3, "label": "group"}, {"id": 4, "label": "name", "properties": ["op1", "op2"], "values": ["Elsevier", "N.V."]}, {"id": 5, "label": "country"}, {"id": 6, "label": "name", "properties": ["op1"], "values": ["Netherlands"]}, {"id": 7, "label": "publish-01"}, {"id": 8, "label": "chairman"}], "edges": [{"source": 5, "target": 6, "label": "name"}, {"source": 3, "target": 7, "label": "ARG0-of", "normal": "ARG0"}, {"source": 0, "target": 1, "label": "ARG0"}, {"source": 3, "target": 4, "label": "name"}, {"source": 0, "target": 3, "label": "ARG1"}, {"source": 3, "target": 5, "label": "mod", "normal": "domain"}, {"source": 0, "target": 8, "label": "ARG2"}, {"source": 1, "target": 2, "label": "name"}]}
{"id": "20003001", "flavor": 2, "framework": "amr", "version": 0.9, "time": "2019-04-09 (14:46)", "input": "A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.", "tops": [0], "nodes": [{"id": 0, "label": "report-01"}, {"id": 1, "label": "person"}, {"id": 2, "label": "research-01"}, {"id": 3, "label": "cause-01"}, {"id": 4, "label": "asbestos"}, {"id": 5, "label": "form"}, {"id": 6, "label": "use-01"}, {"id": 7, "label": "make-01"}, {"id": 8, "label": "product"}, {"id": 9, "label": "filter-02"}, {"id": 10, "label": "cigarette"}, {"id": 11, "label": "name", "properties": ["op1"], "values": ["Kent"]}, {"id": 12, "label": "once"}, {"id": 13, "label": "percentage"}, {"id": 14, "label": "include-91"}, {"id": 15, "label": "person"}, {"id": 16, "label": "die-01"}, {"id": 17, "label": "cause-01"}, {"id": 18, "label": "disease"}, {"id": 19, "label": "name", "properties": ["op1"], "values": ["cancer"]}, {"id": 20, "label": "person"}, {"id": 21, "label": "work-01"}, {"id": 22, "label": "expose-01"}, {"id": 23, "label": "before"}, {"id": 24, "label": "now"}, {"id": 25, "label": "more-than"}, {"id": 26, "label": "temporal-quantity", "properties": ["quant"], "values": ["30"]}, {"id": 27, "label": "year"}, {"id": 28, "label": "high-02"}], "edges": [{"source": 8, "target": 10, "label": "mod", "normal": "domain"}, {"source": 1, "target": 2, "label": "ARG0-of", "normal": "ARG0"}, {"source": 17, "target": 18, "label": "ARG0"}, {"source": 16, "target": 17, "label": "ARG1-of", "normal": "ARG1"}, {"source": 3, "target": 4, "label": "ARG0"}, {"source": 4, "target": 5, "label": "mod", "normal": "domain"}, {"source": 20, "target": 22, "label": "ARG1-of", "normal": "ARG1"}, {"source": 6, "target": 12, "label": "time"}, {"source": 3, "target": 13, "label": "ARG1"}, {"source": 25, "target": 26, "label": "op1"}, {"source": 15, "target": 16, "label": "ARG1-of", "normal": "ARG1"}, {"source": 22, "target": 4, "label": "ARG2"}, {"source": 20, "target": 21, "label": "ARG0-of", "normal": "ARG0"}, {"source": 26, "target": 27, "label": "unit"}, {"source": 10, "target": 11, "label": "name"}, {"source": 18, "target": 19, "label": "name"}, {"source": 6, "target": 7, "label": "ARG2"}, {"source": 13, "target": 28, "label": "ARG1-of", "normal": "ARG1"}, {"source": 7, "target": 8, "label": "ARG1"}, {"source": 22, "target": 23, "label": "time"}, {"source": 23, "target": 25, "label": "quant"}, {"source": 0, "target": 3, "label": "ARG1"}, {"source": 23, "target": 24, "label": "op1"}, {"source": 0, "target": 1, "label": "ARG0"}, {"source": 14, "tar

This file has been truncated. show original

lizhzh8 · May 6, 2022, 6:10am

哦哦，那demo中的
{‘id’: ‘0’, ‘input’: ‘男孩希望女孩相信他。’, ‘nodes’: [{‘id’: 0, ‘label’: ‘男孩’, ‘anchors’: [{‘from’: 0, ‘to’: 2}, {‘from’: 12, ‘to’: 13}]}, {‘id’: 1, ‘label’: ‘希望-01’, ‘anchors’: [{‘from’: 3, ‘to’: 5}]}, {‘id’: 2, ‘label’: ‘女孩’, ‘anchors’: [{‘from’: 6, ‘to’: 8}]}, {‘id’: 3, ‘label’: ‘相信-01’, ‘anchors’: [{‘from’: 9, ‘to’: 11}]}], ‘edges’: [{‘source’: 1, ‘target’: 3, ‘label’: ‘arg1’}, {‘source’: 1, ‘target’: 0, ‘label’: ‘arg0’}, {‘source’: 3, ‘target’: 2, ‘label’: ‘arg0’}, {‘source’: 3, ‘target’: 0, ‘label’: ‘arg1’}], ‘tops’: [1], ‘framework’: ‘amr’}

是 Meaning Representation Parsing（MRP）格式了，也是一个字典dict。

lizhzh8 · May 10, 2022, 9:33am

谢谢hankcs。我对amr很感兴趣。现在有几个疑惑，想请教下大佬。
一：perin-parser是将Perin打成了package吗？我看着Perin数据预处理中有
1，self.validation_data = {
(“amr”, “eng”): f"{base_dir}/2020/cf/validation/amr.mrp",
(“amr”, “zho”): f"{base_dir}/2020/cl/training/amr.zho_val.mrp",
2，self.companion_data = {
(“amr”, “eng”): f"{base_dir}/2020/cf/companion/combined_udpipe.mrp",
(“amr”, “zho”): f"{base_dir}/2020/cl/companion/combined_zho.mrp",
这两类数据。这两类数据哪里能够下载到呢？看着LDC官网上的Demo中sample数据没有上述两个。

二：如果用perin-parser训练的话，找到合适数据后直接运行perin_parser/preprocess.py，然后perin_parser/train.py
就可以了吗？

三：
levi-graph-amr-parser论文和代码大体上阅读了一遍，看着用的数据是LDC的amr的penmen格式数据，在解析数据时用到了amr_parser。请问amr_parser和perin-parser 有什么关系吗？

麻烦了

hankcs · May 13, 2022, 7:59pm

amr.mrp和combined_udpipe.mrp这两种数据是 MRP2020 组委会发布的AMR语料和Companion Data，你可以试试向对方申请。

对，我用mengzi训练的脚本如下：

parser = PerinParser()
save_dir = 'data/model/amr-zho-mengzi-base'
cprint(f'Model will be saved in [cyan]{save_dir}[/cyan]')
parser.fit(
    training_data={
        # ('amr', 'eng'): 'data/mrp/2020/cf/training/amr.mrp',
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_train.mrp',
    },
    validation_data={
        # ('amr', 'eng'): 'data/mrp/2020/cf/validation/amr.mrp',
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_val.mrp',
    },
    test_data={
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_val.mrp',
    },
    companion_data={
        ("amr", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("amr", "zho"): f"data/mrp/2020/cl/companion/combined_zho.mrp",
        ("drg", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("drg", "deu"): f"data/mrp/2020/cl/companion/combined_deu_translated.mrp",
        ("eds", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ptg", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ptg", "ces"): f"data/mrp/2020/cl/companion/combined_ces.mrp",
        ("ucca", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ucca", "deu"): f"data/mrp/2020/cl/companion/combined_deu.mrp",
    },
    frameworks=[["amr", "zho"]],
    save_dir=save_dir,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    workers=2,
    encoder='Langboat/mengzi-bert-base',
    encoder_learning_rate=5e-5,
)
cprint(f'Model saved in [cyan]{save_dir}[/cyan]')

并没有太大关系，前者的concept预测是autoregressive的，后者是parallel的。Deng Cai的paper称concept和edge的预测可以通过iteration提升，我们的paper去掉了iteration取得了相同的结果。perin甚至连concept都不是iteratively生成的，更加验证了iteration没有什么作用。

lizhzh8 · May 14, 2022, 1:27pm

感谢han，后来我在GitHub的issues里边看到过你的这个结论。

lizhzh8 · May 14, 2022, 1:29pm

我试着向MRP2020组委会申请下Companion Data。谢谢han。