Transformer Neural Networks – EXPLAINED! (Attention is all you need)

Please subscribe to keep me alive:

BLOG:

MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning:
📕 Calculus:
📕 Statistics for Data Science:
📕 Bayesian Statistics:
📕 Linear Algebra:
📕 Probability:

OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization:
📕 Python for Everybody:
📕 MLOps Course:
📕 Natural Language Processing (NLP):
📕 Machine Learning in Production:
📕 Data Science Specialization:
📕 Tensorflow:

REFERENCES
[1] The main Paper:
[2] Tensor2Tensor has some code with a tutorial:
[3] Transformer very intuitively explained – Amazing:
[4] Medium Blog on intuitive explanation:
[5] Pretrained word embeddings:
[6] Intuitive explanation of Layer normalization:
[7] Paper that gives even better results than transformers (Pervasive Attention):
[8] BERT uses transformers to pretrain neural nets for common NLP tasks. :
[9] Stanford Lecture on RNN:
[10] Colah’s Blog:
[11] Wiki for timeseries of events: (machine_learning_model)

37 Comments

  1. what a hugely underrated video. You did such a better job at explaining this on multiple abstraction layers in such a short video than most videos I could find on the topic which were more than twice as long.

    1. this comment is hugely underrated. You kept me from watching other videos on the topic which are more than twice as long.

    2. definitely an underrated video. what you taught how an RNN works in 30 seconds was a 47 minute presentation from MIT about the same!. I could never understand the transformer model from the original paper, but i think i can relate now. Keep educating.

    3. This is one of the best you tube videos i have seen on a technical subject..Just made the transformers easy.

  2. Great video! Watched it a few times already so these timestamps will help me out:

    0:00 Problems with RNNs and LSTMs
    3:34 First pass overview over transformer architecture
    8:10 Second pass with more detail
    10:34 Third pass focusing on attention networks and normalization
    11:57 Other resources (code & blog posts)

    1. Thank you so much! I planned to watch this a few times for reference as I delve into transformer code. This will be very useful.

  3. Wow.
    I’ve seen lectures that are 45m+ long trying to understand this architecture, even lectures from the original authors.
    Your video was hands-down the best, really helped me piece some key missing intuition pieces together.
    You have a gift for teaching and explaining — I wholeheartedly hope you’re able to leverage that in your professional career!

  4. The multi-pass approach to progressively explaining the internals worked well. Thanks for your content!

  5. That was an awesome explanation. I have a question about the Add & Norm block. Do you add the embedded vector before or after performing normalization ? Is there even a difference if we do one instead of the other ?

  6. Thanks, man. This is a really clear and high-level explanation. Really helpful for some guys like me who just stepped into this area.

    I read many explanations online. They give tons of details but fail in explaining these abstract items. These explanations always use other abstract definitions to explain this one. This problem happens again in the explanation of the “other abstract item”. Sometimes I just forgot originally what I want to understand. Or even worse, they form a circulation…

    Thank you so much! This video helped me a lot in understanding the paper

  7. Great explanation. Could you do another video on positional encoding specifically? It seems to be very important, but I’ve found it the most confusing part of this architecture.

  8. Hi…Thanks a lot for the awesome videos.
    Wanted to ask, can transformers be used for time series data like for activity recognition? mostly i have seen it being used for NLP tasks

  9. LOVE the multipass strategy for explaining the architecture. I don’t think I’ve seen this approach used with ML, and it’s a shame as this is an incredibly useful strategy for people like me trying to play catch up. I hopped on the ML train a little late, but stuff like this makes me feel not nearly as lost.

  10. Incredibly well explained and concise. I can’t believe you pulled off such a complete explanation in just 13 minutes!

  11. I love the multi-pass way of explanation so that the viewer can process high level concepts and then build upon that knowledge, great job.

  12. Great video!! I am taking a course in my university and one of the lectures was about RNNs and transformers. Your video of 13 mins explains way better than the 100 mins lecture i attended. Thank you!

  13. Really great video. As someone transitioning from pure math into machine learning and AI, I find the language barrier to be the biggest hurdle and you broke down these concepts in a really clear way. I love the multiple layer approach you took to this video, I think it worked really well to first give a big picture overview of the architecture before delving deeper.

  14. Great video :)! But I have one remark: At 8:45 you say that the 8 attention vectors get averaged, but in the original paper “Attention Is All You Need” on page 4, it says that the output from the different attention heads are being concatenated rather than being averaged, which I think would also make more sense. But maybe this is just a misunderstanding on my side.

    1. You are correct. They are indeed concatenated. Averaging would defeat the whole purpose of multi-head attention.

    2. Nice catch! Looking at the paper again in section 3.2.2, it sounds like 8 vectors of 64 dimensions each are concatenated (not averaged) to give an output vector of 512 for a given word – at least for the base architecture.

      Caution rant ahead: To @A B’s point – Most likely, I would agree that averaging would lead to poorer performance than concatenation. With averaging, there are less degrees of freedom for representing a word – 64 dimensions vs 512 dimension. That said, I’m not entirely convinced averaging instead of concatenation would be completely useless until I try this out! In other applications outside of attention, averaging of a collection of related representations to create another representation is common. For example, if I wanted to represent a user in a recommendation system for Amazon, I could take the representations of the last 10 products a person looked at (64 dimensions each) and average them. I could also concatenate them and get a vector of 640 dimensions instead to represent a user. Problem here is that processing / storing such large information is tough. Hence averaging could be a better option despite potentially worse results. It really depends on what you want to do here.

      NOTE: The official transformer paper says concatenation as you correctly pointed out. This last paragraph is just to show that averaging of representations can be a viable option in general applications. Pinning this comment for now. And I encourage more discussion. Mostly going to make another video on transformer embeddings in the future as well

    3. @CodeEmporium I think that is why they have added layer and normalization to enhance speed after concatenation

  15. Oh man I’m going through this now! Didn’t even realise you had a vid on it, this is brilliant. Love that you did a bird’s eye pass then dove in.

    1. You’re saying you liked my content without watching this video? You must be a true fan of mine mwahahaa..also thanks for the kind words 🙂

    2. @CodeEmporium I mean, tbf, I clicked like before watching it because it’s you, then I was like dayyyyuuumm this is great.

  16. This is by far the best explanation of the Transformers architecture that I have ever seen! Thanks a lot!

  17. Very well, clearly explained the concepts, and nice visualization video. I spent a couple of hours and read multiple blog/tutorial about transformers, but I learned a lot more from your 13 mins video compared to those tutorials. Great job. I subscribed to your channel after watching this. Keep up the good work

Leave a Reply

Your email address will not be published. Required fields are marked *

Amazon Affiliate Disclaimer

Amazon Affiliate Disclaimer

“As an Amazon Associate I earn from qualifying purchases.”

Learn more about the Amazon Affiliate Program