Detection of Truthful, Semi-Truthful, False and Other News with Arbitrary Topics Using BERT-Based Models

Easy and uncontrolled access to the Internet provokes the wide propagation of false information, which freely circulates in the Internet. Researchers usually solve the problem of fake news detection (FND) in the framework of a known topic and binary classification. In this paper we study possibilities of BERT-based models to detect fake news in news flow with unknown topics and four categories: true, semi-true, false and other. The object of consideration is the dataset CheckThat! Lab proposed for the conference CLEF-2022. The subjects of consideration are the models SBERT, RoBERTa, and mBERT. To improve the quality of classification we use two methods: the addition of a known dataset (LIAR), and the combination of several classes (true + semi-true, false + semi-true). The results outperform the existing achievements, although the state-of-the-art in the FND area is still far from practical applications.


I. I NTRODUCTION AND PROBLEM SETTINGS
The easy access and up-to-date content of the news portals and social media have made them more and more popular, attracting a huge number of users.Unfortunately, with the increase in traffic, the number of fake news is increasing too.Such a circumstance makes the task of fake news detection one the most important in the development of modern human communications.
A wide range of fake news themes exists, such as politics, Covid-19, and ecology -and fake news for each theme have their own peculiar properties.Nowadays the fake news detection problem is usually posed as the binary classification of fake news of a specific topic into fake or real news.The results of such classification can provide very good results, via modern models, especially involving transformers.
However, in real life fake news are usually not related to a single specific topic, for example in a single article we could see fake news about both politics and Covid-19.Moreover, when we speak about real cases of fake news, we are often faced with news which contain both fake and true information.This fact could be explained by the assumption that it is easy to believe in a news which looks more realistic (and contains a piece of truth), and the creators of fake news may be using this trick intentionally or not.
Because of this, we concentrate our efforts on the problem of fake news detection in social media where we have to deal with the nonspecific fake news and try to implement multi-class classification of fake news.Such conditions -an absence of a specific topic of fake news and a multiclassing of fake newsbring us as close as possible to the situation we have in the real world.
We have two main goals for our work:

•
We will study the possibilities of multi-class fake news detection, where we indicate false, true, partially false, and other classes of news.We will use state-of-the-art transformers for this purpose; • We will propose novel classifiers based on the classes' combination to improve the quality of fake news detection.

II. R ELATED WORK
The problem of fake news detection is wide and includes a lot of important questions and approaches which could be interesting for researchers.In this Section, we discuss the existing techniques for fake news detection, and the works of other researchers, which show high results in the area of fake news detection.Our work is devoted to multi-class fake news detection, but to understand how to implement multi-class classification in the best way, firstly we need to understand how to work with fake news classification in general.To facilitate comparison and analysis, we have classified the approaches for fake news detection into three main groups: Classical Machine Learning, Neural Networks, and other approaches.In addition, we will observe the works devoted to the multi-class classification of fake news.

A. Classical Machine Learning Approach for Detection of Fake News
The most commonly used algorithms for fake news detection are classical machine learning algorithms.They demonstrate good results when we deal with binary classification.
The authors [1] analyzed fake news connected with COVID pandemic.They collected a large dataset using materials of 150 users from different social media including Twitter, email, mobile, Whatsapp and Facebook for 4 months from March 2020 to June 2020.Traditional K-Nearest Neighbor provided the quality of results 0.79 F1-score and 0.91 F1-score in March and June respectively.
In [2,3] the researchers aimed to detect fake news connected with Covid-19 on a small dataset of 1000 fake and real news.The researchers compared Logistic Regression, Support Vector Machine, Gradient Boosting and Random Forest.The winners proved to be Support Vector Machine and Random Forest c with their 69% micro-F1 score.Although this result is worse than the previous one, it may be useful for those who deal with limited datasets.
In [4] the authors compared the same algorithms on the large dataset connected with Covid-19 and taken from Facebook, Twitter and other social media platforms.The best result -0.93 F1-score -was achieved with the SVM model.

B. Neural Networks for Detection of Fake News
Researchers use a lot of interesting linguistic models, the most important of which is BERT [5], which stands for Bidirectional Encoder Representations from Transformers.Such models demonstrated high results in various natural language processing tasks including text classification [6,7,8].
DistilBERT model is a modification of BERT having reduced size and providing acceleration 60% [9].CT-BERT (Covid-Twitter-BERT) is a trained variant of DistilBERT, which showed very good results of fake news detection on Twitter messages [10].
RoBERTa is a robust variant of BERT, which needs larger datasets and longer time for its training [11].Researchers used the RoBERTa-base variant implementation with cosine similarity computed by averaging over the token vectors to obtain contextual word embeddings.They tuned this model using the dataset of tweets connected with COVID-19.As a result, the domain-adapted BERTScore achieved the best results among the similarity models.This study is described in [12].
Hierarchical Attention Networks (HAN) is based on LSTM (Long Short-Term Memory) architecture.It includes four sequential levels -word encoder, word-level attention, sentence encoder and sentence-level attention [13].It was successfully used for political fact-checking [14].
The authors of [15] used well-known CNN (Convolutional Neural Network) to detect fake news about Covid-19 in the LIAR dataset.With binary classification they could obtain 0.46 accuracy.
In [16] the authors created an ensemble of linguistic models XLNet, RoBERTa, XLM-RoBERTa, DeBERTa, ERNIE 2.0, and ELECTRA for the task of fake news detection.Their news were connected with several given topics.Furthermore, the authors implemented Heuristic Post-Processing, which takes Soft-voting prediction vectors into account.Thanks to the careful preprocessing step, such an ensemble, and the softvoting through prediction vectors instead of the hard-voting approach, the authors could achieve an F1-score of 0.98.
To deal with a problem of fake news detection, the authors of [17] implemented a combination of topical distributions from Latent Dirichlet Allocation (LDA) with contextualized representations from XLNet.Their dataset included real and fake information about Covid-19.For the implementation, the authors used the Transformers library maintained by the researchers and engineers at Hugging Face [18], which provides the PyTorch interface for XLNet.Such a complex model allowed researchers to achieve 0.97 F1-score on the test dataset.
The problem of fake news detection was a main focus of the competition named Constraint@AAAI2021 -COVID19 Fake News Detection.The challenge contained the same task for two languages: English and Hindi.The participants needed to create a system for binary classification of fake news and real news.The English dataset contained 10700 messages, 5100 messages were the real one and 5600 news were the false one.Real news were collected from reliable sources.It is WHO (World Health Organization) and CDC (Centers for Disease Control and Prevention).Fake news were collected from social media such as Facebook posts, Twitter tweets, Instagram posts, etc.The dataset contains 37,503 unique words.166 teams took part in the challenge with the English dataset, and 114 of them overcame the baseline of 93% F1-score.The best results proved to be very close to each other and all were higher than 98% F1-score.
The winner of the challenge g2tmn team [19] achieved 98.69% F1-score using the ensemble of three pretrained CT-BERT models.The second result 98.65% F1-score was achieved by the saradhix team [20].During the research, the authors used several classical machine learning methods such as Naive Bayes, etc. and also several Transformer models.The third result with 98.60% F1-score was obtained by the xiangyangli team [21].In this case, the authors also used Text Transformers for their research.Additionally, they used a Pseudo Label Algorithm to do data augmentation To summarize the best approaches from the Constraint@2021 Fake News Detection open shared task, we may fix that:

•
The most successful models were created as ensembles of Text Transformers, • The most important step was to fine-tune such Transformers,

•
The preprocessing steps, which are usually very important for the classical ML models, had no significant role here.

C. Other features for Detection of Fake News
Both classical machine learning models and neural networks play an important role in solving the problem of fake news detection, but we may point out many other features and approaches, which are popular in natural language processing, and which could be helpful for fake news detection.We mean here first of all the task of text classification.For example, we can mention algorithms using n-grams of words or characters [22], GloVe [23], an unsupervised learning algorithm for obtaining vector representations for words, Fast-text [24], the classifier, which is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation, Label Smoothing -a technique ща regularization that uses noise for the labels [25]; adversarial training [26], a novel regularization method for classifiers to improve model robustness for small, approximately worst case perturbations; and tax2vec [27], a Semantic space vectorization algorithm.
There are some interesting works describing the impact of fake news about Covid-19, for example [28], which is devoted to researching the social impact of fake news in a context of health information in social media.The authors analyzed materials from three social media platforms: Twitter, Facebook and Reddit.The authors' research questions concentrated on ways that social media messages focused on fake health information, and how interactions based on health evidence with social impacts helped overpower fake health information.
As a result, the authors revealed that misinformation and fake news is higher in Twitter (19%) than in Facebook (4%) or Reddit (7%).They also revealed emotional characteristics messages, namely, messages focused on false health information are mostly aggressive, while messages based on evidence of social impact prove to be more peaceful and respectful.
The authors [29,30] studied emotional reactions, and their role in fake news detection on the Twitter platform and Politifact dataset (available at https://www.mpi-inf.mpg.de/dl-credanalysis/).Fake news can reflect completely different topics.
The authors [31] determined the contribution of different topics in dynamics to the publications of well-known media.Here the multi-class definition of topics can be considered as a consequence of reliable and unreliable (real and fake) information in publications.
The approach taken in [32] could help create a dataset for fake news detection, choosing articles for it using key phrases, the advantage of which is that there is no need for a priori information.

D. Multi-class Classification of Fake News
As we mentioned previously, in the real world the data could contain not just fake information and true information, but both fake and true information in one news.In this case, we speak about multi-class classification -a state when we have two hypothetical poles -completely fake news and completely real news, and all other news between these two poles, which contain true and fake information in different proportions.
Here we firstly mention the paper [33].The authors classified political materials from the LIAR dataset in 5 classes: false, barely-true, half-true, mostly-true, true.The training set and test set contains 10200 and 1200 texts respectively.With LSTM the authors obtained accuracy of 0.42.The authors of [34] presented the fake news classification of 6 classes: pants-fire, false, barely-true, half-true, mostly-true, and true messages.For the experiments, they used SVM, Logistic Regression, Bi-LSTM, CNN, and Hybrid CNNs, and they obtained the highest accuracy of 0.27 using Hybrid CNNs.
Multi-class fake news detection was one of the central topics of the conference CLEF-2022.The competition was organized in the framework of challenge Shared Task 3 -CheckThat!Lab [35].The aim of the task was to classify news articles in English and German into 4 classes: true, partially true, false, or other, where partially true articles were a mixture of true and false information, and 'other' class of classification had articles which cannot be categorized as true, false, or partially false due to lack of evidence about its claims.The proposed dataset included several given topics.
The winners of the challenge on English data achieved 0.34 macro-averaged F1-score, where researchers used a BERTbase-uncased model.The authors also conducted experiments with RoBERTa, but the BERT model obtained higher results [36].The second-best result in the experiments on English dataset was 0.33 macro-averaged F1-score, where researchers created an ensemble of a Funnel Transformer and a Feed Forward Neural Network [37] For the English-German cross lingual task, the highest result was achieved with 0.29 macro-averaged F1-score, where researchers used BERT-large model [38].The second-best result was obtained with 0.23 macro-averaged F1-score.Researchers conducted experiments with different transformers and the best one proved to be mDeBERTa model [39].
The problem of multi-class classification, the classification for more than two (fake news and real news) classes is not researched well and, as we mentioned above, even complex and well-trained models show quite low results on the task of multi classification in comparison with binary classification of fake news.

III. D ATASETS
In this Section, we describe the datasets we use for the experiments on the multiclass fake news classification.

A. Basic dataset for CheckThat-2022
As the main dataset for the experiments, we use the CheckThat-2022 Task 3 dataset [35].We described this challenge in the previous Section and mentioned the best results that were achieved on the dataset.For the work, we use only the English part of the dataset.
The data for the dataset was collected from 2010 to 2022.The dataset has 4 labels: true, partially true, false, and other.While a 'fake' group of messages contain articles with the untrue main claim, and a 'true' group of messages contains articles with the primary elements of the main claim are demonstrably true, the 'partially true' articles contain information, which can't be accepted as completely true.The 'other' group of messages contains articles that cannot be categorized as neither 'true', 'false', nor 'partially false' due to a lack of evidence about its claims.This category includes articles in dispute and unproven articles.The Table I presents the contents of the dataset.The training dataset includes 900 messages, the development dataset consists of 364 messages, and the test dataset includes 612 messages.

B. LIAR dataset
With the idea that the increase in the number of the train messages could improve the results of the classification, we add the LIAR [34] dataset to the training dataset of CheckThat-2022.
The LIAR dataset includes more than 12,000 short messages collected from POLITIFACT.COM between 2007 and 2016, and all messages are labeled for six groups: pants-fire, false, barely-true, half-true, mostly-true, and true.As our goal was to implement a three-class (because of the absence of the 'other' class) classification for fake and non-fake messages, we relabeled these labels: pants-fire and false messages as fake messages, barely-true, half-true and mostly-true messages as partially false messages, and true messages as real messages.
The statistics of the LIAR dataset are presented in Table II.After relabeling, we have 2,052 true messages, 3,553 false messages, and 7,183 partially false messages .The below Fig. 1 illustrates the percentage ratio of different messages in the ChechThat-2022 and the LIAR datasets.

A. BERT-based models
We conducted the experiments on the CheckThat-2022 dataset using the modern transformers models, including: • The mBERT [5] model, which supports 104 languages, and is 12-layer, 768-hidden, 12-heads, 110M parameters -the transformer model which was pre-trained on the top 104 languages in Wikipedia, using a masked language modeling (MLM) objective.As we showed in our previous work, the BERT model provides really high results on fake news detection task in social media, and mBERT could be used not only for English, but also for multilingual experiments.In this work, we apply it for English dataset only, but with an idea to apply it to the multilingual corpora in future.For the experiments, we used the batch size of 8, 512-token input, and 0.5 dropout.
• XLM-RoBERTa [40] is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks.It is trained on filtered CommonCrawl data in 100 languages.This model is also very prospective for our case and could be used for multilingual classification as well.We declared the batch size of 8, 128-token input, and 0.5 dropout.
• SBERT [41] is a modification of the pretrained BERT including siamese structure and triplet network structure to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.Because of this, SBERT is faster than BERT/RoBERTa, but still provides quality results at the BERT level.SBERT also supports more than 100 languages and could be used for multilingual experiments.We used the batch size of 8, 128-token input, and 0.3 dropout.
For the experiments with all models, we used a learning rate of 1e-7.Using these models, we conducted the experiments on the datasets we described earlier: the English part of the CheckThat-2022 dataset, and on the combination of the English part of the CheckThat-2022 dataset and the LIAR dataset (where we used the LIAR dataset for the training purpose only, and the test dataset included only test CheckThat-2022 dataset).

B. Experiments with basic datasets
We conducted experiments with all three transformers: mBERT, SBERT, and XLM-RoBERTa of the CheckThat-2022 dataset, and the results of experiments are presented in Table III.As the highest results were obtained using the mBERT model, we rely on this model as the most prospective one, and it is important to clarify the results we have obtained using mBERT in detail.The full results of classification using mBERT are presented in Table IV.As it is shown in Table 4, the results of the classification by classes are very heterogeneous, and the classifier predicts false messages with high results in comparison with predictions of other classes and with the overall results.Our future experiments are aimed to improve these classification results.

C. Experiments with extended dataset
As we mentioned previously, the idea of the explanation of the CheckThat-2022 dataset for the results improvement looks promising, and we added the LIAR dataset for the train CheckThat-2022 dataset.The results of experiments on the expanded dataset using mBERT, XLM-RoBERTa, and SBERT are presented in Table V.Based on the results of the expanded dataset which are higher than the results on the single CheckThat-2022 dataset, we could conclude that our assumption where the increase of the training data will improve the result of classification is correct for this task of multi-class fake news classification.

A. Combined classes
As we mentioned above, the state-of-the-art results of multiclass fake news classification on the CheckThat-2022 dataset are low (less than 0.40 macro F1-score), so the most important target is to improve the performance on this dataset.We propose the approach for multi-class classification which is based on the idea of a combination of different classes with the aim to find the intersection of them.
In the first step of our approach, we combine the false and partially false in one joint false class.The idea is to collect all messages with fake news in one class, and to implement the new three-classes classification for joint false, true, and other news.
The same way, we combine true and partially false classes in one joint true class, and implement the three-classes classification for joint true, false, and other news.
In the second step of our approach, we choose all the messages that were labeled as joint false and joint true in the first step of the classification, and find the intersection of these two classes.According to our idea, the messages that fell into the intersection area belong to a partially false class.The graphic illustration of our idea is presented in Fig. 2. We believe that this two-steps approach could allow us to improve the quality of partially false messages detection, and, in case of success, could be implemented for the other classes of implementation too.

B. Experiments with combined classes
We implemented the experiments in two steps.Firstly, we combined the false and partially false messages in one class (joint false), and implemented the 3-classes classification for the true, joint false, and other classes.
We also combined the true and partially false (partially true) messages of the CheckThat-2022 dataset in one class (joint true), and implemented the 3-classes classification for the false, joint true, and other classes.The results of classification for the joint false and the joint true classes are presented in Table VI.In the second and final step, we merged the same messages from the joint true and the joint false classes with the assumption that the area of intersection of these two classes consists of partially false (partially true) messages.The results of classification for partially false messages on the intersection are presented in Table VII.The macro F1-score for partially false class on the intersection of the joint true and the joint false classes is 0.21, while the macro F1-score for partially false class on the full CheckThat-2022 dataset is 0.13, therefore we can conclude that our idea is correct and we can improve the results of partially false messages detection by the detection of intersections of the combined classes.

Joint false Joint Partially false
The graphical illustration of the basic classification results using mBERT model, and the classification results using mBERT model with combined classes are presented in Fig. 3 Fig. 3. Basic results of classification using mBERT model and results with the combined classes (Macro F1-score).

A. Results
In this paper we presented the most common and effective approaches for fake news detection and classification, including not only binary classification for fake and real messages, but also multi-class classification of fake news.
We conducted experiments on the CheckThat-2022 dataset with three transformers: mBERT, SBERT, XLM-RoBERTa, and the best results were obtained with mBERT, namely 0.22 macro F1score, while the SOTA results on the dataset provides a 0.34 macro F1-score.Despite the fact that our result is lower than SOTA, we should mention that, firstly, all the results on the dataset are low, and, secondly, all the results obtained on the dataset are relatively close to each other.This could be explained by the unbalanced nature of the dataset where the size of the biggest class (fake) is six times bigger than the size of the smallest class (other).
To improve the performance of our models, we expanded the CheckThat-2022 dataset by the LIAR dataset and showed that the increase in the number of training messages improves the results of multi-class classification.The important moment here is that the average length of the message in the LIAR dataset is much shorter that the average length of the message in the CheckThat-2022 dataset (18 words vs 731 words), and it means that short and long messages are interchangeable in the context of classification using transformers.
Also, to improve classification we implemented the new twostep procedure for partially false news detection.This approach includes the relabelling of false, true, and partially true classes at the first step with the idea to isolate the partially true news in the intersection of classes at the second step.This approach allowed us to improve the quality of partially false news detection from a 0.13 macro F1-score to a 0.21 macro F1-score.
Although such a value is still far from the practical use, it is 1.5 times better than that we have before and simultaneously it shows one of the ways to improve all the results.
Finally, with combined classes we achieved a relatively good result for the Joint False class with its 0.75 macro F1-score having in view unknown topics in the news under consideration.Such a result may be already interesting for practice.

B. Future work
In the future, the most interesting and the most challenging task is to improve the results of multi-class classification of fake news.As we showed in this work, the possible way of improvement is to expand the dataset using external sources.As the database is unbalanced, the possible way to increase the performance is to balance the dataset, to collect the data for the classes with the lowest number of messages, such as other, partially false, and true classes.We demonstrated that adding even short messages to the train dataset helps for the performance improvement, and it would be interesting to study the case when the long news (the same length as the messages in the CheckThat-2022 dataset) are added to the dataset.
The second possible way for the performance improvement is to follow the approach we presented in this paper, when we implement the two-step multi-class classification with the aim to isolate the messages of one specific group.By grouping relevant classes it may be possible to implement our approach for the other classes of classification.
The idea of multi-class classification of fake news looks promising as it is close to real life, where we deal with not only purely true and purely false news, but also with a mix of them.The next step could be the splitting of the partially false news class into more specific classes, which could catch peculiarities of fake news and bring us closer to the goal of effective fake news detection.

Fig. 1 .
Fig. 1.Percentage ratio of different messages in the CheckThat-2022 and the LIAR datasets

Fig. 2 .
Fig. 2. Search for messages belonging to the partially false class

TABLE I .
S TATISTICS OF THE C HECK T HAT -2022DATASET

TABLE II .
S TATISTICS OF THE LIARDATASET

TABLE III .
R ESULTS OF EXPERIMENTS ON THE C HECK T HAT -2022DATASET

TABLE IV .
R ESULTS OF EXPERIMENTS WITH M BERT MODEL

TABLE V .
R ESULTS OF EXPERIMENTS ON THE C HECK T HAT -2022 + LIAR

TABLE VII .
R ESULTS FOR THE PARTIALLY FALSE CLASS OF CLASSIFICATION