Posted by Admin: System Admin
Fake news is a major threat to democracy (e.g., influencing public opinion), and its impact cannot be understated particularly in our current socially and digitally connected society. The research community from different disciplines (e.g., computer science, political science, information science, and linguistics) have also studied the dissemination, detection and mitigation of fake news, however it remains challenging to detect and prevent the dissemination of fake news in practice. With AI powered systems, its highly crucial to understand the detector’s decision of fake news by means of proper user-friendly explanations when it comes to social media. Hence, in this paper, we systematically survey existing state-of-the-art approaches designed to detect and mitigate the dissemination of fake news, and based on the analysis, we discuss several key challenges and present potential future research agenda specially incorporating AI explainable Fake news credibility system. Machine learning is an important component of the growing field of data science. Through the use of statistical methods, different type of algorithms is trained to make classifications or predictions, and to uncover key insights in this project. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics. Machine learning algorithms build a model based on this project data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of datasets, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
Due to the extensive volume of literature on this topic, we focused only on SCI-indexed technical journal articles from the year 2019 (e.g., we excluded conference articles, and book chapters). Here, we will briefly summarize the most recent works on fake news detection into seven detection categories as shown in this sysem. The taxonomy comprises four layers. On the first layer, the studies are sorted based on the focus of the research. Each color code represents one detection based research focus and based on that, we have divided the work of researchers based on type of fake news content on layer 2, fake news features on layer 3 and data set categories on layer 4, respectively. Based on our study, the researchers have also put much emphasis on feature identification in detecting fake news. Features have played the most important role in models specifically in fake news detection as real and fake classes have very similar characteristics. Figure 3 aims to represent various aspects of features studied in this survey. We can look at the features from following points of view: Being imitated: Some features are difficult to be mimicked by malicious users (topographical features), while most of them can be imitated easily. Sensitive to time: For instance, [37] shows that Word n-gram features may provide information which is less relevant at that point of time. Required level of computational resources: Some features are readily available, while some others like layer ratio in [38] need processing sources to calculate. Applicability: Some features cannot be applied to all situations and platforms. For example, domain reputation related features can not be utilized in social network platforms. Necessary time to emerge: As an example, reactions of other users to a news article need a time period to be revealed. Explainable: Features achieved by deep learning methods act like a black-box and are not interpretable in contrast to hand-crafted ones. Pre-processing steps: Unlike features such as n-grams, some characteristics like user profile based ones do not need much pre-processing stages. Potential to be transferred: Pre-trained word embedding can be used and updated in new situations. Source: Features can be extracted from different sources such as contextual content, images, videos, profiles, etc. Required size of corpus: Features gained from LIWC, NELA, different embedding need large corpus. language-independent: Features such as word embedding, word-grams, character-grams are dependent on the language, while profile based are independent. Domain-independent: Most of the content-based features and features extracted from images cannot apply to all domains. To illustrate, vectors gained from word embedding need a corpus that contains sufficient data for the underlying domain. Alatas et. al. [16] has used a two-step method for fake news detection. The first step is converting unstructured data into structured dataset by applying pre-processors. TF weighting method and Document-Term Matrix are applied to present dataset in the form of vectors. After this preprocessing, the author implemented twenty-three supervised artificial intelligence algorithms on the structured dataset by text mining methods, which is the second step of this method. Alatas et al evaluated these methods experimentally based on four evaluation metrices using public datasets. To detect fake news, Kaliyar et. al. [39] used deep convolutional neural network (FNDNet). The proposed model is designed to automatically detect the features of fake news and differentiate them from those of real news. For this purpose, the author has used multiple hidden layers in building the deep neural network to classify fake news based on the extracted features. Each layer of the developed deep Convolutional Neural Network (CNN) extracts several features. Based on experimentally attained results, the performance of Kaliyar’s model is compared with baseline modes and is found to have a great accuracy of 98.36% on the test data. The model is trained through benchmarked data sets and the results are validated through six performance evaluation metrices, which include accuracy, recall, F1 score, precision, true negative rate, and falsepositive rate. Henrique et al. [40] has proposed a text-feature based, language independent methodology for fake news detection. The generated text is independent of the source platform and language. The proposed methodology is experimented on five datasets in three language groups giving satisfactory results in comparison to the benchmark. The results obtained through this study are compared against benchmark, custom set of features and bag of words and word2Vec. The best results were achieved through a bag of- words approach when implemented on feature set. But this approach generates large matrices when applied on text. Therefore, DCD algorithm is applied in this scenario by reducing the size of matrix without significant loss in performance metric. Pedro trained the model using Natural Language Processing representation as well through a customized set of features to make it robust enough to extract features from the raw news and make the model less dependent on language. Disadvantages ? The system is not implemented Synthetic training data generation for fake news detection. ? The system is not implemented User profile based feature.
Neves et al. [41] introduce a novel approach (GANfingerprint Removal autoencoder - GANprintR) to spoof facial manipulation detection systems. This approach removes the GAN fingerprints without compromising the image quality. T hus, the machines, like humans, will not be able to distinguish fake images from real ones. GANprintR is trained with face images of real persons instead of synthetic face images with GAN-fingerprint. This strategy is based on the premise that by training with real images GANprintR can learn the main structure of real face images, which can be helpful to enhance existing fakes, subsequently. To carry out this study, three state-of-the-art manipulation detection approaches are used: XceptionNet[42], Steganalysis[43] and Local Artifacts[44]. Also, three different scenarios are designed: 1) controlled scenarios 2) in-the-wild scenarios 3) GAN-fingerprint removal. In the pre-Processing step all the background information is removed from the images while the facial regions are kept wherever it is possible. In order to gain unbiased results, only the frontal face images are kept as well. After the previous stage, images of constant size (224 _ 224 pixels) are available as input for the systems. According to the results, for controlled scenarios (when the same samples are utilized for both development and evaluation of the detection models) XceptionNet has excellent manipulation detection accuracies with EER values less than 0.5%, then Steganalysis represents good performance, and finally Local Artifacts shows poor accuracy with an average of 35.5% EER. On the contrary, in in-thewild scenarios, the obtained results for all manipulation detection models experience a high reduction. Similarly, for the third scenario (applying GANprintR) this decrease of performance is even further which proves the success of the GANprintR in building enhanced versions of the original fake images. This study also analyses how different image transformations such as resolution downsizing, applying a low-pass filter and JPEG image compression affect facial manipulation detection systems. Wang et al, [47] has presented SemSeq4FD which is a graph-based neural network model for the task of fake news detection. The model is designed to detect fake news early and is based on enhanced text representation. The model is developed by presenting pair-wise semantic relations between various sentences and representing them in the form of a graph. Self-attention mechanism is applied using graph convolutional network and global sentence representation is studied. To extract the exact meaning of a sentence, 1D CNN is employed to get the local context of sentence. Enhanced sentence representation is achieved by combining the two representations. LSTM based network is applied afterwards that models the enhanced sentence representation achieved by the combination of two representations. This provides a final document representation which is used for fake news detection. Advantages ? The system is more effective due to Ensemble learning based detection using ml classifiers. ? The system is more suitable due to Manipulation of Deep Fake algorithms in images and videos which are converted into datasets.