Data Science for Humanity: One of the First-Ever Machine Learning Models to Aid in the War Crisis- Russian Ukrainian
towardsai.net
Author(s): MSVPJ Sathvik Originally published on Towards AI. We have all been seeing the transformation of data science from being used extensively in technical domains for analysis to being used as an excellent tool for solving social and global issues. It has helped open different perspectives on significant global crises in various fields, such as poverty and overpopulation.Lately, one of the most concerning crises- humanitarian crises has had its integration with data science applications to address the urgent needs of thousands of vulnerable populations today.The news headlines speak for themselves, bleeding the truths about the disastrous Russia-Ukraine conflict that has affected millions of Ukrainian citizens seeking help. This has moved us as researchers and motivated us to build something that would help society, helping it contribute to the Ukrainian Resilience.We have crafted the first-ever dataset that detects help-seeking signals within Ukrainian social media posts amidst the ongoing Russia-Ukraine conflict. We have used machine learning models and natural language processing (NLP) to train and identify distress signals. This advanced application of data science for humanitarian aid would bring us closer to society and change the world.The backstory: What motivated us to work on this dataset?War forces many people to leave their lovely, comfortable, and well-established lives and enter extreme situations, displacing them from their homes and burdening them psychologically. Although very effective, traditional aid methods have responded slowly in these scenarios, and the resources are often stretched. We have realized that less effective research has been conducted in applying data science and machine learning to better the adverse consequences of war, pushing us to design this dataset.Our dataset aims to address these particular challenges by giving technological solutions that identify genuine calls for help from different social media platforms that have been taken as data streams.We have categorized the posts into two main categories those seeking help and those that do not. The Ukrainian Resilience dataset has been designed based on machine learning models trained to filter, analyze, and assist in delivering support to those in dire need as soon as possible.This binary classification filter is a powerful tool for guiding resources to the affected communities, creating a real-world impact where they are most needed.Data collection: How did we collect the data and perform annotation?We have partially collected the data for the Ukrainian Resilience dataset from social media platforms like Twitter and Reddit, which are popular and widely used in Ukraine for sharing updates and seeking help.With the help of Ukrainian language experts, we have collected many keywords directly related to the conflict, likeRussia-Ukraine war, help, and other similar Ukrainian terms from the posts. The dataset had around 11,677 posts, with over 5,782 labeled as non-help-seeking and 5,895 labeled as help-seeking. The distribution has been made to ensure a balanced dataset for training the model and enhance the models ability to detect subtleties and differentiate ordinary posts from those that express urgent needs.The Ukrainian language experts have performed annotation and were trained to identify linguistic cues that indicate stress or any requests for aid.We validated these annotations using the inter-annotator agreement, which yielded a perfect consistency score(Krippendorffs alpha of 0.827), representing the reliability of the dataset.Our teams careful selection and annotation of data points has given the dataset a strong foundation for NLP applications to detect distress signals, making further processes more straightforward and accurate.Performance: How did we train the dataset, and what were the results?Once we had the dataset in our hands, we were all set to put it to the test by using various language models. We have used models like XLM RoBERTa, mBERT, LLAMA 2, and GPT-3.5 to evaluate the performance and obtain the results. We have tested the models on subsets of the data using standard metrics like precision, recall, and overall accuracy. Among them, GPT-3.5 achieved the highest accuracy at 81.15%, representing the potential of advanced language models to detect help-seeking posts effectively. The models high accuracy indicates its proficiency in capturing nuanced language cues, which help detect the subtle differences between a general update and a genuine cry for help.We have organized the training and the testing split with a ratio of 80:20, maintaining representativeness from both subsets.We used five epochs across the models, and this continuous approach ensured a fair comparison among the models while each was meticulously tested.Test results: Detection of help-seeking postsSpeedbumps: Challenges we faced and the importance of error analysisDespite the incredible accuracy of these advanced models, we still experienced challenges, and many limitations persisted. Several false positives occurred with the posts that contained emotional language but did not indicate any signs of immediate distress. These misclassifications and errors explain that the models can sometimes misinterpret strong language when used along with the updates as urgent help-seeking posts.Similarly, we observed that the urgent posts were sometimes classified as non-help-seeking, and false negatives tended to arise in some posts with local dialects or indirect language.We felt a clear need for improved cultural and linguistic sensitivity within the model, especially the local dialects and expressions with idioms that vary widely across Ukraine.Moreover, we have also noted that sarcastic or ambiguous language can confuse the model. Error analysis remains essential in refining the models capability since it reveals areas where additional training data or model adjustments and updates may be necessary.These findings were quite noteworthy as they create a pathway for improvement, enhancing the accuracy of models, minimizing errors in detecting distress signals, and creating a better impact.Dive into the real world: Applications and benefits of this datasetThis Ukrainian Resilience dataset establishes the further steps in applying data science to humanitarian crises. We feel that this would help facilitate the identification of help-seeking posts, allowing NGOs, governments, and other agencies to respond quickly by providing support to those in need, whether for shelter, food, or medical aid.We can also further extend the applications to localized use cases, such as enabling volunteers to assist targeted regions or identify specific groups that are in need.Future adaptations could also allow multilingual translations, expanding their applications to other conflicts and regions and making them a universal tool for detecting distress signals in different crises, broadening the impact of data science on global efforts.Ethical considerations: Protection and prevention of misuseEthical aspects of this work are quite delicate due to its urgency and sensitivity. We have been very responsible since the beginning of this datasets usage, ensuring that it is only shared with the NLP researchers and the human welfare-focused organizations we have interacted with.We have handled the dataset with integrity and ensured that restrictions are brought into the picture to prevent misuse.Another primary ethical concern was that the dataset was text-based, though some distress signals were conveyed through images and videos. We believe that incorporating these elements would expand the scope of the dataset.Future directions: The scope of our datasetThough the Ukrainian Resilience dataset has made significant milestones, it does have limitations like we just went through before. The major one is that the focus on Ukrainian language text limits its capacity and applicability and cannot be applied to posts in other languages or regions.Moreover, as this dataset is only text-based, it lacks visual-image and video data, which may also convey distress signals that could be essential to help detect more accurately in this context.ConclusionThe Ukrainian Resilience dataset, which integrates machine learning models, illustrates the power of data science in detecting help-seeking signals.This dataset has laid the base for timely, data-guided support that provides help to vulnerable populations in war and conflict zones. As data science grows, its markable impact in solving real-world crises will become significant and offer potent means to deliver aid and transform lives.The Ukrainian Resilience dataset is a fantastic example of how technology can drive immediate change even in the most complicated and sensitive environments.Citations: MSVPJ Sathvik, A. Dowpati, and S. Sethi, Ukrainian Resilience: A Dataset for Detection of Help-Seeking Signals Amidst the Chaos of War, IIIT Dharwad, Cognicore AI, University of Delhi, Zero True, Raickers AI,2024.Image source: Images generated by OpenAIs DALL-E model; also adapted from MSVPJ Sathvik, A. Dowpati, and S. Sethi. Ukrainian Resilience: A Dataset for Detection of Help-Seeking Signals Amidst the Chaos of War. IIIT Dharwad, Cognicore AI, University of Delhi, Zero True, Raickers AI, 2024.We welcome you to explore our work through the link provided below: https://aclanthology.org/2024.findings-emnlp.16.pdfJoin thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Comments ·0 Shares ·51 Views