Structural Knowledge Distillation: Tractably Distilling Information for...

This paper factorized the knowledge distillation objective of structure prediction problems and showed its benefits. For structure prediction problems defined as:

\def\sY{{\mathbb{Y}}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\mcZ}{\mathcal{Z}} \newcommand{\mcL}{\mathcal{L}} \newcommand{\mcdp}{\text{DP}} \newcommand{\fix}{\marginpar{FIX}} \newcommand{\new}{\marginpar{NEW}} \newcommand{\Score}{\text{Score}} \newcommand{\F}{\text{F}} \newcommand{\Se}{\text{S}_\text{e}} \newcommand{\St}{\text{S}_\text{t}} % Vectors \def\emLambda{{\Lambda}} \def\emA{{A}} \def\emB{{B}} \def\emC{{C}} \def\emD{{D}} \def\emE{{E}} \def\emF{{F}} \def\emG{{G}} \def\emH{{H}} \def\emI{{I}} \def\emJ{{J}} \def\emK{{K}} \def\emL{{L}} \def\emM{{M}} \def\emN{{N}} \def\emO{{O}} \def\emP{{P}} \def\emQ{{Q}} \def\emR{{R}} \def\emS{{S}} \def\emT{{T}} \def\emU{{U}} \def\emV{{V}} \def\emW{{W}} \def\emX{{X}} \def\emY{{Y}} \def\emZ{{Z}} \def\emSigma{{\Sigma}} \def\vzero{{\bm{0}}} \def\vone{{\bm{1}}} \def\vmu{{\bm{\mu}}} \def\vtheta{{\bm{\theta}}} \def\va{{\bm{a}}} \def\vb{{\bm{b}}} \def\vc{{\bm{c}}} \def\vd{{\bm{d}}} \def\ve{{\bm{e}}} \def\vf{{\bm{f}}} \def\vg{{\bm{g}}} \def\vh{{\bm{h}}} \def\vi{{\bm{i}}} \def\vj{{\bm{j}}} \def\vk{{\bm{k}}} \def\vl{{\bm{l}}} \def\vm{{\bm{m}}} \def\vn{{\bm{n}}} \def\vo{{\bm{o}}} \def\vp{{\bm{p}}} \def\vq{{\bm{q}}} \def\vr{{\bm{r}}} \def\vs{{\bm{s}}} \def\vt{{\bm{t}}} \def\vu{{\bm{u}}} \def\vv{{\bm{v}}} \def\vw{{\bm{w}}} \def\vx{{\bm{x}}} \def\vy{{\bm{y}}} \def\vz{{\bm{z}}} \begin{align} P(\vy|\vx) & = \frac{\exp{(\Score(\vy,\vx)})}{\sum_{\vy^{\prime} \in \sY(\vx)} \exp{(\Score(\vy^{\prime},\vx)})}\nonumber \\ % & = \frac{\exp{(\Score(\vy,\vx)})}{\mcZ(\vx)} \nonumber \\ & = \frac{\prod_{\vu \in \vy}\exp{(\Score(\vu,\vx)})}{\mcZ(\vx)} \label{eq:2.1} \end{align}

Its structural KD objective can be factorized to:

\def\sU{{\mathbb{U}}} \def\1{\bm{1}} \begin{align} &\mcL_{\text{KD}} % &= \textrm{KL}(P_t(\vy|\vx)||P_s(\vy|\vx)) = -\sum_{\vy \in \sY(\vx)} P_t(\vy|\vx){\log} P_s(\vy|\vx) \nonumber \\ % &= - \big( \sum_{\vy \in \sY(\vx)} P_t(\vy|\vx)(\Score_s(\vy) - \log \mcZ_s)\big)\nonumber\\ & {=} {-}{\sum_{\vy \in \sY(\vx)}} P_t(\vy|\vx) \sum_{\vu \in \vy} \Score_s(\vu,\vx) {+} {\log} \mcZ_s(\vx) \nonumber\\ % &\quad\quad + \log \mcZ_s(\vx) \nonumber \\ &{=} {-} {\sum_{\vy \in \sY(\vx)}} P_t(\vy|\vx) {\sum_{\vu \in \sU_s(\vx)}} \1_{\vu \in \vy} \Score_s(\vu,\vx) {+} {\log} \mcZ_s(\vx) \nonumber \\ % &\quad\quad + \log \mcZ_s(\vx) \nonumber \\ % & {=} {-}\sum_{\vu \in \sU_s(\vx)} \sum_{\vy \in \sY(\vx)} P_t(\vy|\vx) \1_{\vu \in \vy} \Score_s(\vu,\vx) \nonumber \\ & {=} {-}{{\sum\sum}_{\substack{\vu \in \sU_s(\vx), \vy \in \sY(\vx)}}} P_t(\vy|\vx) \1_{\vu \in \vy} \Score_s(\vu,\vx) {+} {\log} \mcZ_s(\vx) \nonumber \\ % &\quad\quad + \log \mcZ_s(\vx) \nonumber \\ & {=} {-}{\sum_{\vu \in \sU_s(\vx)}}P_t(\vu|\vx) \Score_s(\vu,\vx){+}{\log} \mcZ_s(\vx) \label{eq:kl} \end{align}

As long as P_t(\vu|\vx) is tractable this objective is tractable. Note that the substructures of teacher and student do not need to be identical but still require some compatibility in order to be tractable. Usually, fine-grained substructures will be marginalized to fit coarse-grained ones. In an extreme case, at least one of them needs to be computed through dynamic programming.


  • It’s mathematically beautiful.
  • The organization of this research is skilled. {coarse fine} times {student teacher} = 4 combinations. These kind of factorizations make the paper more convincing.
  • I can tell their research advisor is very experienced and their group basically dominate this topic.
  • Though it is a good research paper, it has little practical significance since the student models perform poorly in comparison to the teachers.
  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.

0 投票者