Toxic Comment Classification Using Machine Learning

Show simple item record

dc.contributor.author Pramodya, L.A.S.
dc.contributor.author Rathnayaka, R.M.G.U.
dc.contributor.author Lahiru, K.K.S.
dc.contributor.author Thambawita, D.R.V.L.B.
dc.date.accessioned 2019-04-06T06:49:08Z
dc.date.available 2019-04-06T06:49:08Z
dc.date.issued 2019-02
dc.identifier.isbn 9789550481255
dc.identifier.uri http://www.erepo.lib.uwu.ac.lk/bitstream/handle/123456789/112/73.pdf?sequence=1&isAllowed=y
dc.description.abstract Comment classification models are available today for “flagging” the comments. However, determining whether or not a comment should be “flagged” is difficult and time-consuming. Another major problem is the lack of sufficient data for training the model, and there are some issues with the available datasets because those are annotated by the human raters and those annotations are dependent on their personal beliefs. Lack of multi-label comment classification model causes for issues of abusive behavior. This paper presents models for multi-label text classification for identifying the different level of toxicity within a comment. In this paper, we use Wikipedia comments which have been labeled by human raters for toxic behavior provided by Kaggle. Comments have been categorized into six categories as toxic, severe-toxic, obscene, threat, insult, and identityhate. The dataset contains 159572 comments. For data analyzing we use python seaborn library and python matploitlib library. It is understood that the dataset is highly skewed. Most of the comments do not belong to any of the six categories. Researchers used undersampling for majority class to correct the bias in the original dataset. We tested three models: a feed-forward neural network with Keras and word embedding, a Naive Bayes model with Scikit-Learn, and a LightGBM with 4-fold cross-validation. For the neural network, it took 3.5 hours to be trained on Nvidia GeForce 840M which is having 384 CUDA cores, Naive Bayes model with Scikit-Learn took 3 hours where LightGBM with k-fold took 4 hours. Researchersran 100 epochs from each model. At the end of 100 epoch, the neural network gave 0.9930 of validation accuracy and loss was just 0.2714, Naive Bayes model with Scikit-Learn gave 0.9556 validation accuracy and loss was 0.4121 where LightGBM with k-fold accuracy was 0.9000 and validation loss was 0.4263. The neural network gave the best accuracy at the end of the 100th epoch. en_US
dc.language.iso en en_US
dc.publisher Uva Wellassa University of Sri Lanka en_US
dc.subject Computer Science en_US
dc.subject Information Science en_US
dc.subject Computing and Information Science en_US
dc.title Toxic Comment Classification Using Machine Learning en_US
dc.title.alternative International Research Conference 2019 en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search UWU eRepository


Browse

My Account