Our study focuses on Twitter messages and, more specifically, on hateful, xenophobic, and racist speech in Greek aimed at refugees and migrants.
Our contribution includes the development of a new dataset for hate speech classification, consisting of tweet IDs, along with the code to obtain their visual appearance, as they would have been rendered in a web browser. We have also released a pre-trained Language Model trained on Greek tweets, which has been used in our experiments. We report a consistently high level of accuracy (accuracy score = 0.970, f1-score = 0.947 in our best model) in racist and xenophobic speech detection.
Hate speech is defined by Cambridge Dictionary as “public speech that expresses hate or encourages violence towards a person or group based on something, such as race, religion, sex, or sexual orientation”.
it is not uncommon that accounts who engage in hate speech tend to be prone to general toxic behaviour against LGBTQ communities or other social, and not necessarily ethnic minorities.
we collected all tweets from the hashtag #απέλαση (deportation), along with two months of tweets containing the racist slang term “λάθρο”, which is used to refer to undocumented immigrants (λάθρο from λαθραίος, illegal), as prime instances of hateful tweets.
We then extracted a set of 500 Twitter users from these tweets and further enriched the user base with accounts appearing in mentions and replies in the bootstrapped data.
We also included known media and public figure accounts, resulting in a set of 1263 users in total.
For the annotation task, we used the docanno tool
The dataset of the tweet IDs we used for this work can be found at https://github.com/kperi/MultimodalHateSpeechDetection (accessed on 12 March 2021).
combine text and image modalities to detect hate speech
users engaging in hate speech to use visual elements to denote their ideology . This is also common in the Greek context, in which users tend to include the Greek flag in both their usernames and their background images.
using the Greek version of BERT
bert-base-greek-uncased-v1 (12-layer, 768-hidden, 12-heads, 110M parameters).
The tweets have been lower-cased and accents have been stripped before they were fed to the classifier.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.