DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
#deberta #bert #huggingface
DeBERTa by Microsoft is the next iteration of BERT-style Self-Attention Transformer models, surpassing RoBERTa in State-of-the-art in multiple NLP tasks. DeBERTa brings two key improvements: First, they treat content and position information separately in a new form of disentangled attention mechanism. Second, they resort to relative positional encodings throughout the base of the transformer, and provide absolute positional encodings only at the very end. The resulting model is both more accurate on downstream tasks and needs less pretraining steps to reach good accuracy. Models are also available in Huggingface and on Github.
OUTLINE:
0:00 - Intro & Overview
2:15 - Position Encodings in Transformer’s Attention Mechanism
9:55 - Disentangling Content & Position Information in Attention
21:35 - Disentangled Query & Key construction in the Attention Formula
25:50 - Efficient Relative Position Encodings
28:40 - Enhanced Mask Decoder using Absolute Position Encodings
35:30 - My Criti
1 view
5
2
4 years ago 00:45:14 1
DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)