The HuggingFaceNmtEngine class currently implements the ATT-OUTPUT approach from this paper. The ATT-INPUT method would generate better quality alignments. In order to implement ATT-INPUT, the class would need to shift the attentions to the left one step. This can be done by not adding a 0 matrix at the beginning of the attentions. We would also need to change what layer the attentions are retrieved from (bottom layers). For ATT-INPUT, it is possible for the last token to not get aligned if the translation has hit the max generation length. This edge case should be handled properly.