Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Chatgpt Zen chatgptzen_i214yi March 10, 2023 · 0 Comment

We pretrain a BERT (Bidirectional Encoder Representations from Transformers) model from scratch in PyTorch, on domain specific data (eg confidential company data). We code in Python to train an optimized Tokenizer for our data, design a BERT architecture from scratch and start pre-training of BERT with a masked Language Model Head (MLM). We define the vocabulary size according to our needs (from 8K to 60K), define the depth of our BERT architecture (eg 96 layers) and train days on (a single) GPU for our domain specific knowledge encoding.

BERT :: Bidirectional Encoder Representations from Transformers is a transformer-based machine learning technique for natural language processing.

With an advanced BERT model (pre-trained on our special texts) we can then build a SBERT model (Sentence Transformers) for a Neural Information Retrieval (IR) system.

official Links to my sources (all rights with them):

!! COLAB to follow along:

#sbert
#ai
#naturallanguageprocessing