Unsupervised Text Style Transfer with Padded Masked Language Models

Some style transfer tasks only require very local edits, e.g., deleting or replacing some tokens. This paper proposes an unsupervised method for this purpose by training 2 MLMs on source and target domain then replacing the spans of highest likelihood disagreement with prediction of the target MLM.

Padded Masked Language Models Likelihood

Conventional MLM can only predict one token per mask so that the number of tokens needs to be predefined and cannot be inferred. Padded MLM always create masks no shorter than the real tokens such that [PAD] denotes non-existing tokens.

The likelihood for a span under Padded MLM is defined as the product of the probability of each token/mask.

Score all possible spans

In terms of target model, its prediction should be of high likelihood while the source span should be of low likelihood.

\newcommand{\bv}[1]{#1} \newcommand{\p}[1]{P_{\texttt{MLM}}\left(#1\right)} \newcommand{\pll}[1]{\texttt{PLL}\left(#1\right)} \newcommand{\pl}[1]{\mathcal{L}\left(#1\right)} \newcommand{\plonly}{\mathcal{L}} \newcommand{\pad}{\texttt{\texttt{[PAD]}}} \newcommand{\mask}{\texttt{\texttt{[MASK]}}} \newcommand{\stoken}{\texttt{[SOURCE]}} \newcommand{\ttoken}{\texttt{[TARGET]}} \newcommand{\np}{\ensuremath{n_p}} \newcommand{\wf}[1]{\texttt{w}(#1)} \newcommand{\score}[1]{\texttt{Score}(#1)} \newcommand{\scoreonly}{Score} \newcommand{\sscore}[1]{\texttt{SourceScore}(#1)} \newcommand{\sscoreonly}{SourceScore} \newcommand{\tscore}[1]{\texttt{TargetScore}(#1)} \newcommand{\tscoreonly}{TargetScore} \newcommand{\LT}{\textsc{LaserTagger~}} \newcommand{\FI}{\textsc{FelixInsert~}} \newcommand{\berttobert}{\textsc{Bert2Bert~}} \newcommand{\masker}{\textsc{Masker}} \begin{split} \tscore{i,j} = \; & \pl{\widehat{W}_{i:j}^\texttt{target} \mid W_{\backslash{i:j}}; \Theta_\texttt{target}} \\ & - \pl{W_{i:j} \mid W_{\backslash{i:j}}; \Theta_\texttt{target}}. \end{split}

On source model side, target prediction should be of low likelihood and source span should be of high likelihood. By replacing the spans, the drop in likelihood should be as high as possible. Not sure why it must be rectified.

\begin{aligned} \sscore{i,j} = - \max \Big[ 0, \pl{\widehat{W}_{i:j}^\texttt{target} \mid W_{\backslash{i:j}}; \Theta_\texttt{source}} \\ - \pl{W_{i:j} \mid W_{\backslash{i:j}}; \Theta_\texttt{source}} \Big] \end{aligned}

The final score is the combination of these two.

\score{i,j} = \tscore{i,j} + \sscore{i,j}.


  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.