Transformer Neural Networks – EXPLAINED! (Attention is all you need)

Please subscribe to keep me alive:

BLOG:

MATH COURSES (7 day free trial)
Mathematics for Machine Learning:
Calculus:
Statistics for Data Science:
Bayesian Statistics:
Linear Algebra:
Probability:

OTHER RELATED COURSES (7 day free trial)
Deep Learning Specialization:
Python for Everybody:
MLOps Course:
Natural Language Processing (NLP):
Machine Learning in Production:
Data Science Specialization:
Tensorflow:

REFERENCES
[1] The main Paper:
[2] Tensor2Tensor has some code with a tutorial:
[3] Transformer very intuitively explained – Amazing:
[4] Medium Blog on intuitive explanation:
[5] Pretrained word embeddings:
[6] Intuitive explanation of Layer normalization:
[7] Paper that gives even better results than transformers (Pervasive Attention):
[8] BERT uses transformers to pretrain neural nets for common NLP tasks. :
[9] Stanford Lecture on RNN:
[10] Colah’s Blog:
[11] Wiki for timeseries of events: (machine_learning_model)

37 Comments

what a hugely underrated video. You did such a better job at explaining this on multiple abstraction layers in such a short video than most videos I could find on the topic which were more than twice as long.

ctyuang says:

October 20, 2020 at 7:59 pm

Indeed! By far the best explanation video on youtube I’ve gone through.

Reply
Cristian says:

November 20, 2020 at 7:44 pm

this comment is hugely underrated. You kept me from watching other videos on the topic which are more than twice as long.

Reply
Suresh Narayanan says:

May 26, 2021 at 4:04 am

definitely an underrated video. what you taught how an RNN works in 30 seconds was a 47 minute presentation from MIT about the same!. I could never understand the transformer model from the original paper, but i think i can relate now. Keep educating.

Reply
James Paz says:

March 9, 2022 at 2:12 pm

I agree with this comment

Reply
Yogesh Sharma says:

May 18, 2022 at 8:01 pm

This is one of the best you tube videos i have seen on a technical subject..Just made the transformers easy.

Reply

Great video! Watched it a few times already so these timestamps will help me out:

0:00 Problems with RNNs and LSTMs
3:34 First pass overview over transformer architecture
8:10 Second pass with more detail
10:34 Third pass focusing on attention networks and normalization
11:57 Other resources (code & blog posts)

CodeEmporium says:

March 7, 2020 at 5:26 pm

Thanks for this! It’ll help others watching too.

Reply
Ashish Pant says:

August 12, 2020 at 4:23 pm

@CodeEmporium Pin this comment.

Reply
Matt Coakes says:

January 7, 2021 at 7:40 am

Thank you so much! I planned to watch this a few times for reference as I delve into transformer code. This will be very useful.

Reply

Outstanding explanations: to the point and well illustrated. Thank you.

Wow.
I’ve seen lectures that are 45m+ long trying to understand this architecture, even lectures from the original authors.
Your video was hands-down the best, really helped me piece some key missing intuition pieces together.
You have a gift for teaching and explaining — I wholeheartedly hope you’re able to leverage that in your professional career!

Great video! I love how you go through multiple passes, each getting into deeper specifics!

The multi-pass approach to progressively explaining the internals worked well. Thanks for your content!

Andre Costa says:

June 5, 2022 at 11:02 pm

The understanding converges!

Reply

That was an awesome explanation. I have a question about the Add & Norm block. Do you add the embedded vector before or after performing normalization ? Is there even a difference if we do one instead of the other ?

Thanks, man. This is a really clear and high-level explanation. Really helpful for some guys like me who just stepped into this area.

I read many explanations online. They give tons of details but fail in explaining these abstract items. These explanations always use other abstract definitions to explain this one. This problem happens again in the explanation of the “other abstract item”. Sometimes I just forgot originally what I want to understand. Or even worse, they form a circulation…

Thank you so much! This video helped me a lot in understanding the paper

Great explanation. Could you do another video on positional encoding specifically? It seems to be very important, but I’ve found it the most confusing part of this architecture.

Hi…Thanks a lot for the awesome videos.
Wanted to ask, can transformers be used for time series data like for activity recognition? mostly i have seen it being used for NLP tasks

LOVE the multipass strategy for explaining the architecture. I don’t think I’ve seen this approach used with ML, and it’s a shame as this is an incredibly useful strategy for people like me trying to play catch up. I hopped on the ML train a little late, but stuff like this makes me feel not nearly as lost.

Incredibly well explained and concise. I can’t believe you pulled off such a complete explanation in just 13 minutes!

CodeEmporium says:

July 2, 2021 at 9:12 pm

Thank you for the kind words. Super glad you liked it

Reply

I love the multi-pass way of explanation so that the viewer can process high level concepts and then build upon that knowledge, great job.

Great video!! I am taking a course in my university and one of the lectures was about RNNs and transformers. Your video of 13 mins explains way better than the 100 mins lecture i attended. Thank you!

Really great video. As someone transitioning from pure math into machine learning and AI, I find the language barrier to be the biggest hurdle and you broke down these concepts in a really clear way. I love the multiple layer approach you took to this video, I think it worked really well to first give a big picture overview of the architecture before delving deeper.

CodeEmporium says:

October 31, 2021 at 4:22 pm

Super happy this helped!

Reply

Great video :)! But I have one remark: At 8:45 you say that the 8 attention vectors get averaged, but in the original paper “Attention Is All You Need” on page 4, it says that the output from the different attention heads are being concatenated rather than being averaged, which I think would also make more sense. But maybe this is just a misunderstanding on my side.

A B says:

March 4, 2022 at 8:28 pm

You are correct. They are indeed concatenated. Averaging would defeat the whole purpose of multi-head attention.

Reply
CodeEmporium says:

March 5, 2022 at 11:47 pm

Nice catch! Looking at the paper again in section 3.2.2, it sounds like 8 vectors of 64 dimensions each are concatenated (not averaged) to give an output vector of 512 for a given word – at least for the base architecture.

Caution rant ahead: To @A B’s point – Most likely, I would agree that averaging would lead to poorer performance than concatenation. With averaging, there are less degrees of freedom for representing a word – 64 dimensions vs 512 dimension. That said, I’m not entirely convinced averaging instead of concatenation would be completely useless until I try this out! In other applications outside of attention, averaging of a collection of related representations to create another representation is common. For example, if I wanted to represent a user in a recommendation system for Amazon, I could take the representations of the last 10 products a person looked at (64 dimensions each) and average them. I could also concatenate them and get a vector of 640 dimensions instead to represent a user. Problem here is that processing / storing such large information is tough. Hence averaging could be a better option despite potentially worse results. It really depends on what you want to do here.

NOTE: The official transformer paper says concatenation as you correctly pointed out. This last paragraph is just to show that averaging of representations can be a viable option in general applications. Pinning this comment for now. And I encourage more discussion. Mostly going to make another video on transformer embeddings in the future as well

Reply
Chinmaya Mahapatra says:

April 6, 2022 at 3:37 pm

@CodeEmporium I think that is why they have added layer and normalization to enhance speed after concatenation

Reply
Strong Syeda A says:

June 24, 2022 at 4:03 am

@CodeEmporium
Apply transformers on RNN, GRU, LSTM & CNN I mean code wise

Reply

Oh man I’m going through this now! Didn’t even realise you had a vid on it, this is brilliant. Love that you did a bird’s eye pass then dove in.

CodeEmporium says:

January 28, 2022 at 11:15 pm

You’re saying you liked my content without watching this video? You must be a true fan of mine mwahahaa..also thanks for the kind words

Reply
Nicholas Renotte says:

January 30, 2022 at 9:39 pm

@CodeEmporium I mean, tbf, I clicked like before watching it because it’s you, then I was like dayyyyuuumm this is great.

Reply

This is by far the best explanation of the Transformers architecture that I have ever seen! Thanks a lot!

Very well, clearly explained the concepts, and nice visualization video. I spent a couple of hours and read multiple blog/tutorial about transformers, but I learned a lot more from your 13 mins video compared to those tutorials. Great job. I subscribed to your channel after watching this. Keep up the good work

One of the best or probably the best explanation I’ve seen. Thank you very much for the effort.

Transformer Neural Networks – EXPLAINED! (Attention is all you need)

37 Comments

Leave a Reply Cancel reply

Your Advantage Awaits!

Amazon Affiliate Disclaimer

Browse Our Picks!

Explore and Benefit!

Amazon Affiliate Disclaimer

You may be interested in:

37 Comments

Leave a Reply Cancel reply

Browse Our Picks!

Explore and Benefit!

Amazon Affiliate Disclaimer