Some style transfer tasks only require very local edits, e.g., deleting or replacing some tokens. This paper proposes an unsupervised method for this purpose by training 2 MLMs on source and target domain then replacing the spans of highest likelihood disagreement with prediction of the target MLM.

Conventional MLM can only predict one token per mask so that the number of tokens needs to be predefined and cannot be inferred. Padded MLM always create masks no shorter than the real tokens such that [PAD] denotes non-existing tokens.

The likelihood for a span under Padded MLM is defined as the product of the probability of each token/mask.

## Score all possible spans

In terms of target model, its prediction should be of high likelihood while the source span should be of low likelihood.


On source model side, target prediction should be of low likelihood and source span should be of high likelihood. By replacing the spans, the drop in likelihood should be as high as possible. Not sure why it must be rectified.

\begin{aligned} \sscore{i,j} = - \max \Big[ 0, \pl{\widehat{W}_{i:j}^\texttt{target} \mid W_{\backslash{i:j}}; \Theta_\texttt{source}} \\ - \pl{W_{i:j} \mid W_{\backslash{i:j}}; \Theta_\texttt{source}} \Big] \end{aligned}

The final score is the combination of these two.

\score{i,j} = \tscore{i,j} + \sscore{i,j}.