Release a SOTA joint Chinese-English AMR model

A Permutation-invariant Semantic parser trained on MRP2020 English
and Chinese AMR corpus has been released. It was ranked the top in the MRP2020 competition.

This release made several enhancements on runtime speed and robustness. Please refer to the demo for furthur instructions: HanLP/amr_stl.ipynb at doc-zh · hankcs/HanLP · GitHub

amr-zh
amr-en
All the honors go to the paper authors:

3 Likes

:clap: 有段时间没来,发现更新了好多新模型啊!这几天测一下

请问 新发布的 MRP2020_AMR_ENG_ZHO_XLM_BASE,我们能基于这个CAMR模型再标注新数据进行微调吗?我看到现在貌似还没有这方面的接口

可以,训练接口其实已经开源了,只不过代码位于另一个包里:

语料格式也比较复杂,并非penman格式而是mrp格式,需要参考 GitHub - ufal/perin: PERIN is Permutation-Invariant Semantic Parser developed for MRP 2020 安装各种依赖。

谢谢。
现在能够找到的是penman格式的Chinese Abstract Meaning Representation (CAMR) 2.0语料,里边有三种notation格式的语料。
其中,orgin_camr 是penman格式,mrp2020_camr 是mrp格式的,dependency_tree_camr 是依存句法树的格式,对吗?

下边这个是penman格式。
<class ‘hanlp.components.amr.seq2seq.dataset.penman.AMRGraph’>
(x2 / 希望-01
:arg1 (x4 / 相信-01
:arg0 (x3 / 女孩)
:arg1 x1)
:arg0 (x1 / 男孩))

谢谢了

基本是的。

仍然是penman格式,需要自己写code转换。mrp格式样例:

哦哦,那demo中的
{‘id’: ‘0’, ‘input’: ‘男孩 希望 女孩 相信 他 。’, ‘nodes’: [{‘id’: 0, ‘label’: ‘男孩’, ‘anchors’: [{‘from’: 0, ‘to’: 2}, {‘from’: 12, ‘to’: 13}]}, {‘id’: 1, ‘label’: ‘希望-01’, ‘anchors’: [{‘from’: 3, ‘to’: 5}]}, {‘id’: 2, ‘label’: ‘女孩’, ‘anchors’: [{‘from’: 6, ‘to’: 8}]}, {‘id’: 3, ‘label’: ‘相信-01’, ‘anchors’: [{‘from’: 9, ‘to’: 11}]}], ‘edges’: [{‘source’: 1, ‘target’: 3, ‘label’: ‘arg1’}, {‘source’: 1, ‘target’: 0, ‘label’: ‘arg0’}, {‘source’: 3, ‘target’: 2, ‘label’: ‘arg0’}, {‘source’: 3, ‘target’: 0, ‘label’: ‘arg1’}], ‘tops’: [1], ‘framework’: ‘amr’}

是 Meaning Representation Parsing(MRP)格式了,也是一个字典dict。

谢谢hankcs。我对amr很感兴趣。现在有几个疑惑,想请教下大佬。
一:perin-parser是将Perin打成了package吗?我看着Perin数据预处理中有
1,self.validation_data = {
(“amr”, “eng”): f"{base_dir}/2020/cf/validation/amr.mrp",
(“amr”, “zho”): f"{base_dir}/2020/cl/training/amr.zho_val.mrp",
2,self.companion_data = {
(“amr”, “eng”): f"{base_dir}/2020/cf/companion/combined_udpipe.mrp",
(“amr”, “zho”): f"{base_dir}/2020/cl/companion/combined_zho.mrp",
这两类数据。这两类数据哪里能够下载到呢?看着LDC官网上的Demo中sample数据没有上述两个。

二:如果用perin-parser训练的话,找到合适数据后 直接运行perin_parser/preprocess.py,然后perin_parser/train.py
就可以了吗?

三:
levi-graph-amr-parser
论文和代码大体上阅读了一遍,看着用的数据是LDC的amr的penmen格式数据,在解析数据时用到了amr_parser。请问amr_parser和perin-parser 有什么关系吗?

麻烦了

amr.mrpcombined_udpipe.mrp这两种数据是 MRP2020 组委会发布的AMR语料和Companion Data,你可以试试向对方申请。

对,我用mengzi训练的脚本如下:

parser = PerinParser()
save_dir = 'data/model/amr-zho-mengzi-base'
cprint(f'Model will be saved in [cyan]{save_dir}[/cyan]')
parser.fit(
    training_data={
        # ('amr', 'eng'): 'data/mrp/2020/cf/training/amr.mrp',
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_train.mrp',
    },
    validation_data={
        # ('amr', 'eng'): 'data/mrp/2020/cf/validation/amr.mrp',
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_val.mrp',
    },
    test_data={
        ('amr', 'zho'): 'data/mrp/2020/cl/training/amr.zho_val.mrp',
    },
    companion_data={
        ("amr", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("amr", "zho"): f"data/mrp/2020/cl/companion/combined_zho.mrp",
        ("drg", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("drg", "deu"): f"data/mrp/2020/cl/companion/combined_deu_translated.mrp",
        ("eds", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ptg", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ptg", "ces"): f"data/mrp/2020/cl/companion/combined_ces.mrp",
        ("ucca", "eng"): f"data/mrp/2020/cf/companion/combined_udpipe.mrp",
        ("ucca", "deu"): f"data/mrp/2020/cl/companion/combined_deu.mrp",
    },
    frameworks=[["amr", "zho"]],
    save_dir=save_dir,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    workers=2,
    encoder='Langboat/mengzi-bert-base',
    encoder_learning_rate=5e-5,
)
cprint(f'Model saved in [cyan]{save_dir}[/cyan]')

并没有太大关系,前者的concept预测是autoregressive的,后者是parallel的。Deng Cai的paper称concept和edge的预测可以通过iteration提升,我们的paper去掉了iteration取得了相同的结果。perin甚至连concept都不是iteratively生成的,更加验证了iteration没有什么作用。

感谢han,后来我在GitHub的issues里边看到过你的这个结论。

我试着向MRP2020组委会申请下Companion Data。谢谢han。