[Corpora-List] BERT + NN finetune

Phil Gooch philgooch at gmail.com
Sat Aug 29 11:10:31 CEST 2020

Here's a good tutorial on using the HuggingFace transformers library to fine-tune a BERT classifier


You will need to make a couple of changes to the code to get this to work optimally:

1. When creating the BERT tokenizer with tokenizer.encode_plus(), you must explicitly set truncation=True

2. Set the number of epochs to 3 (4 is too many and tends to overfit)

Hope this is useful - it should be possible to modify this code to work with your own data by changing the dataframe columns and labels as appropriate.

For quicker inferencing at run-time, you can modify the code to work with the smaller DistillBERT model, also from huggingface. Or one of the other BERT-type models for sequence classification, e.g

https://huggingface.co/transformers/v2.2.0/model_doc/distilbert.html (DistilBertForSequenceClassification, DistilBertConfig, DistilBertTokenizer) - tokenizer/model is 'distilbert-base-uncased'

https://huggingface.co/transformers/v2.2.0/model_doc/roberta.html (RobertaForSequenceClassification, RobertaConfig, RobertaTokenizer) - tokenizer/model is 'roberta-base'

https://huggingface.co/transformers/v2.2.0/model_doc/albert.html (AlbertForSequenceClassification, AlbrtConfig, AlbertTokenizer) - tokenizer/model is 'albert-base-v2'


On Sat, Aug 29, 2020 at 10:02 AM s.z. aftabi <s.z.aftabi at gmail.com> wrote:

> Dear all,
> I'm already working on a two sentence classification task.
> I aim at using BERT as an embedding layer with a neural network on top (NN
> with more than one layer) then train the NN model and also finetune BERT,
> both with respect to the loss of classification.
> I appreciate your helping me find good reference or implementation about
> how to do that?
> Best regards,
> S.Zahra
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5123 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200829/88d8704a/attachment.txt>

More information about the Corpora mailing list