# Learning to Contextually Aggregate Multi-Source Supervision for Sequence...

Learning from multi-domain data is attractive yet non-trivial. This paper augments BiLSTM-CRF with a very simple linear transform followed by domain attention for multi-source learning.

## Approach

The structure prediction score is defined as:

$$s({x},{y}) = \sum_{t=1}^T (U_{t, {y}_{t}} + M_{{y}_{t-1},{y}_t}),$$

Their simple method is to transform the emission matrix U and transition matrix M source-wise:

$$s^{(k)}({x},{y}) = \sum_{t=1}^T \left((U A^{(k)})_{t, {y}_t} + (M A^{(k)})_{{y}_{t-1},{y}_t}\right).$$

These linear transforms are trained jointly in a trivial joint learning fashion.

To produce the final prediction, the authors propose to vote based on an attention of these sources:

\begin{align} \mathbf{A}_i^* = \sum_{k=1}^K {q}_{i,k} A^{(k)}. \end{align}

where {q}_{i,k} is an attention score produced by softmax \mathbf{q}_i= \text{softmax}(\mathbf{Q} \mathbf{h}^{(i)}),\text{where}\;\mathbf{Q}\in\mathbb{R}^{K \times 2d}.