Exploratoration of COVID’19 NEWS Leveraging GDELT
We have all been told, the technology has brought the globe inside our palm. If so, do we know what’s happening inside our palm? What if we have a collection of everything happening in every nook and corner of the world across the world, it would be very huge for a human to process. But then, the 21st century is data-driven and where huge datasets are computationally simplified to derive insights. Exciting, isn’t it?
This is exactly what The GDELT project is all about.
What is the GDELT project?
Kalev H. Leetaru, the founder of the GDELT, describes it as creating essentially a global catalogue of human society. So, the idea of the GDELT Project is really to create a dashboard of human society. GDELT expands as a Global Database of Events Languages and Tones in one of the largest real-time networks that graphs or maps human society.
Advantages of GDELT:
There are similar initiatives comparable in GDELT size, like the 1000 Genomes project, NASA NEX, and freebase datasets acquired by Google knowledge graph, but the GDELT project stands out on certain aspects such as:
- Being completely Open Source.
- Access to Real-Time Analysis.
- Wide Coverage, spanning across the globe.
- Supports Multilinguality (Over 100 languages)
- Local Content (Mass Translation) especially in the non-western world.
Disadvantages of GDELT:
There were some issues that the GDELT project initially faced as a problem that hindered it from being used extensively in the mode of accessing the data, such as:
- Huge datasets.
- Difficult to handle (Can’t use traditional means of spreadsheets and dataframes)
- Technical Knowledge Prerequisite.
Accessing the GDELT Data:
No one is perfect at the first go, be it an ML model or a human being, we need to re-iterate on our issues and improvise our solutions. That’s what the GDELT project did by offering two other modes of accessing data,
- GDELT Analysis Service:
This tool is a highly user friendly, cloud-based web service, and doesn’t need any technical knowledge as prerequisites. It helps to explore, visualize, process, and gain insights from the GDELT Event Database and the GDELT Global Knowledge Graph.
2. Google BigQuery:
Google’s BigQuery database was custom-designed for datasets like GDELT, enabling near-real-time ad-hoc querying over the entire dataset. It is most promising as it can take out the necessary dataset just with a few lines of SQL code.
GDELT + BigQuery = Query the Planet
Now, thanks to the upgrade, using the GDELT project is very simple. Researchers download datasets as CSV files and use suitable means of analysis platforms according to their use-cases aligned with their goals.
Our Research Goal:
DAV wanted to study the impact of Coronavirus as a global pandemic using various NEWS articles from sources spanned across India. The goal is to make a comparative study on the pandemic impact amongst different states in India by using the data through text analytics using Natural Language Processing.
In this research, we analyse various datasets provided by the GDELT and use the one which aligns best with our needs.
There are various specific datasets available in the GDELT blog which is a subset of the GDELT EVENTS database based on the need of the research. Various available datasets on COVID’19 can be found in the link below: https://blog.gdeltproject.org/?s=Covid-19+Online+News+Narrative
In this case study, we used a New Dataset For Exploring The Global Multilingual Covid-19 Online News Narrative.
Some entities in the dataset are shown in the image below.
Some basic information about the dataset is:
- Size: 13.46 Million+ News Articles
- Content: Timestamp, Source and News.
In our search, we searched for the keywords “Virus” and “COVID’19”. Searching for “Virus” increases Recall but trades off Precision that we consider as a drawback in our search.
The main pros are access to the Machine Translation on about 65 languages with a daily updated live coverage dataset. The link for the updated Multilingual dataset is below. Global Multilingual Covid-19 Online News Narrative Dataset Now Updating
Why the GDELT dataset in our case study?
A common misconception amongst data science practitioners in academia is that they start working on the project on top of the dataset they have. But, in the industry, the data science projects revolve around the problem to be solved and the crux of it can’t be compromised for the lack of available existing data.
So when we decided on the problem of analysing the impact of COVID’19 across, we started writing a python script to gather all news and shortlist the NEWS articles related to COVID’19 by mapping with frequently used keywords like ‘COVID-19’, ‘CORONAVIRUS’, ‘CORONA’, ‘QUARANTINE’, ‘LOCKDOWN’ and extended it to make a state-wise.
We also deployed the above webs scraper as a newsbot in telegram using its API, which gives real-time statewise COVID’19 News data into your telegram feed when hosted at a 24/7 server.
Having worked on this, we could embrace the beauty of the GDELT dataset and the amount of tedious process of the pipeline running life every moment to gather the NEWS. The above python script gathered only NEWS only from one source, while GDELT covered numerous(686) sources spanning across India and even on multilingual NEWS articles with machine translation. The best fact is it is open-sourced and shoutout to its open-hearted creators who created it with the bottom line, “ We’re tremendously excited to see how researchers can use this data to understand how the virus has been covered globally.“
This blog covered insights into the GDELT dataset, accessing it using BigQuery and understanding its in-depth workflow, stay-tuned for an upcoming article that would explain the data processing, visualisations and key insights we interpreted through them.