City, University of London
https://ddj.nicu.md/city/
My name is Nicu Calcea.
I’m a data investigative journalist and City University alumnus.
I work at Global Witness, and I was previously doing data journalism at BBC News and the New Statesman.
My personal website: nicu.md
Introduction to Data Journalism
In its most simple definition, data journalism is the practice of using numbers and trends to tell a story. — Betsy Ladyzhets
Data journalism [is] finding – in data – stories that are of interest to the public and presenting them in the most appropriate manner for public use and reuse. — Bahareh Heravi
An increasing amount of human activity is recorded with data. This means there is a data angle for almost any subject.
We tell some stories every year, month or day. We can greatly simplify or automate stories, giving us more time to focus on in-depth reporting.
Though not without data quality issues and ethical considerations, accuracy is central to data journalism.
There are now stories where a data angle is the only or main angle. By using data, journalists can create news instead of covering them.
Make readers invested in a story by personalising it to their postcode, age or socio-economic status.
Data journalism is exciting (I hope). The pandemic has shown that readers will reward publishers with their clicks.
Source: Source
Since the pandemic, nearly every newsrooms has prioritised data journalism and has been massively hiring for data journalism positions.
New(-ish) platforms like Datawrapper and Flourish allow journalists to create and visualise data stories easier and without much technical expertise.
However, the increased supply of data journalists from courses like this means there are higher entry requirements (R, Python, SQL).
As with all journalism, data journalism starts with a question that the reporter wants to answer.
Data can come from government sources, third parties, or be collected by the reporter themselves.
In most cases, you will need to filter, sort and clean up any errors or missing information in your dataset.
How do you find the answer to your question in the data?
While data doesn’t lie, data publishers do. Do your findings make sense? Can you verify them?
Communicate data in the most suitable way. Usually, but not always, you will visualise your findings.
Source: Paul Bradshaw
Source: Paul Bradshaw
Returns one number added (+) or subtracting (*) another.
Returns one number divided (/) or multiplied (*) by another.
Returns the sum of a series of numbers and/or cells.
Returns the numerical average value in a dataset, ignoring text.
Returns the median value in a numeric dataset.
Shows percentage change.
Finds the most common value in a range.
Finds the value that’s right in the middle of a dataset.
Sum all the values and divide by the number of records.
Deadline: Friday, 13 December, 16:00 Marking: 40% of your final mark
Deadline: Friday, 24 January, 16:00 Marking: 60% of your final mark
Introduction to Data Journalism
https://ddj.nicu.md/city/
More sources: ddj.nicu.md
Source: ddj.nicu.md
Returns one number added (+) or subtracting (*) another.
Returns one number divided (/) or multiplied (*) by another.
Returns the sum of a series of numbers and/or cells.
Returns the numerical average value in a dataset, ignoring text.
Returns the median value in a numeric dataset.
Shows percentage change.
Returns one value if the result is true, another if it’s false.
Count all the cells that match a condition.
Sum all the cells that match a condition.
Combine multiple bits of text together. Use =SPLIT() for the opposite.
Match the values in a cell with the corresponding row in another dataset.
Same as =XLOOKUP() but more flexible and easier to grasp.
Introduction to Data Journalis
https://ddj.nicu.md/city/
Source: ddj.nicu.md
Pivot tables are extra tables in your spreadsheet, in which you can summarise data from your original table.
You can calculate averages, counts, max/min values or sums for numbers in a group.
Bonus points: Grab a CSV from police.uk and do it yourself.
Finds the most common value in a range.
Finds the value that’s right in the middle of a dataset.
Sum all the values and divide by the number of records.
Introduction to Data Journalis
https://ddj.nicu.md/city/
It’s all about:
Documentation
Reproducibility
Source: Datawrapper
Sometimes, records disappear or were never collected. It may not always be the obvious when that is the case.
Records can be repeated, either due to technical mishaps or due to repeated input.
Humans make mistakes. Assume any dataset manually created by humans to have missspelings.
Some spreadsheets are designed to be read by humans, not computers. We then need to teach software to read it correctly.
Wrong Excel formula? Üṅṛëċöġṅïṡëḋ characters? Old Excel version? These can all mess with your data.
A column can have different unites, spell categories differently or record different methodologies.
Source: Quartz
Source: ONS
Source: The Guardian
Source: OpenRefine
Source: Food Standards Agency
You will be split into teams of (tbd) to create a narrated PowerPoint presentation critiquing a data project featured in the Sigma Awards. You can choose a winner or a short-listed project.
Deadline: Friday, 13 December, 16:00
Deliverables
Data Stories
https://ddj.nicu.md/city/
Summarising data, like we did in previous lessons, is not always enough to reveal pattern or trends.
Visualising it can provide insight we’d otherwise lose out on.
Position
Size
Width
Height
Area
Colour
Fill
Colour
Opacity
Pattern
Shape
Location
Standard scatter plot
Change scale to log
Size by population
Colour by continent
Animate over time
Source: FT
XLOOKUP
to join the datasets.Data [Visualisation]
https://ddj.nicu.md/city/
Source: ddj.nicu.md
Horizontal or vertical rectangles with lengths proportional to the values that they represent.
Good for comparing across different values or showing a trend over time.
Shows values on a continuous scale. Similar to a scatter plot, except all dots are connected.
Good for showing trends over time.
Similar to a line chart but the area underneath the line is coloured in. When stacked, it can show multiple data series as well as their cumulative trend.
Good for showing trends over time.
Plots a dataset across two continuous dimensions, each on a different axis (X and Y).
Good for showing correlation between different data series.
Only works with geographical data (duh!).
Even with geographical data, other charts can often be a better choice.
Source: ddj.nicu.md
Source: Datawrapper
Source: Daily Mail
Source: The Sun
Source: Reuters
Source: CBS News
Source: WTF Visualizations
Source: Reddit
Pre-defined areas such as countries, regions or districts are coloured (either sequential, diverging or categorical) in proportion to values in a dataset.
Circles are drawn on top of a map, with their size or colour proportional to values in a dataset.
Cartograms resize regions in proportion to a variable in your dataset, such as population.
Hex maps standardise administrative units into same-sizes hexagons, squares or triangles.
Try to only use maps when there’s a geographical pattern to your data.
Don’t make a map if it’s going to basically be a population map.
Source: xkcd
This can be polygons (areas), lines or points. Some tools have a few options by default, or you can get additional ones from the ONS, Natural Earth or ArcGIS.
This is the data that will be placed in the shapes on your map. Normally contains region IDs or coordinates (latitude and longitude).
Data Projects
https://ddj.nicu.md/city/
Often, the best narratives warrant going further than simple graphics.
Tailored visualisations designed specifically for a story will almost always be the best way to tell a story.
Interactivity can sometimes help portray the intricacies of a story better.
Source: Gurman Bhatia
Source: Masters of Media
Trackers are data visualisations connected to a data source that is periodically updated.
Examples include FiveThirtyEight’s Biden approval rating tracker, Bloomberg’s Pret Index and the New Statesman’s Covid-19 tracker.
Source: New Statesman
Calculators allow readers to input their own data and receive a result.
Examples include the FT personal data worth calculator, the New Statesman election calculator and the BBC’s energy calculator.
Source: BBC News
Scrollytelling is the use of a browser’s scrolling functionality to interactively tell a data story.
Examples include The Impatient List, the New York Times delta variant story and the New Statesman million years lost investigation.
Source: BBC News
Information visualisation is meant to clarify data, but too much interactivity hinders understanding by transferring responsibility from the designer to the reader to work out the important points. — Martin Stabe (FT)
Readers just want to scroll, […] if you make the reader click or do anything other than scroll, something spectacular has to happen. — Archie Tse (NYT)
Interactive graphics are not just a fun addition but can actually increase the transparency of our work, open us for criticism, and thereby, hopefully, help re-build some trust in journalism. — Gregor Aisch (Datawrapper)
Scraping
https://ddj.nicu.md/city/
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. — Cloudflare
Many organisations still publish data in PDFs, a proprietary format that is difficult to work with. Sometimes, they even do it on purpose.
If you can’t find the data behind a chart, ask the author. If you can’t do that either, read it from the image.
If you’ve got structured information on your page, you’ll most likely be able to download in a format that you can analyse.
Source: Tabula
Source: WebPlotDigitizer
Source: Map Digitizer
Source: Online Journalism Blog
Source: Inspect Element
=IMPORTHTML()
formula to import one of the tables.{"_id":"missingpersons","startUrl":["https://missingpersons.police.uk/en-gb/case-search/?page=[1-10]&orderBy=dateDesc"],"selectors":[{"id":"gender-age","parentSelectors":["case"],"type":"SelectorText","selector":"a:nth-of-type(7) div:nth-of-type(1) div","multiple":false,"regex":""},{"id":"reference","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(2) div","multiple":false,"regex":""},{"id":"location","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(3) div","multiple":false,"regex":""},{"id":"case","parentSelectors":["_root"],"type":"SelectorElement","selector":"a.CaseThumbnail","multiple":true},{"id":"link","parentSelectors":["case"],"type":"SelectorLink","selector":"_parent_","multiple":false},{"id":"ethnicity","parentSelectors":["link"],"type":"SelectorText","selector":"div.Entry:nth-of-type(3) div.Value","multiple":false,"regex":""}]}
At COP29, after delayed negations, advanced economies pledged to provide $300bn a year to low and middle income countries by 2035.
Climate vulnerable countries said this isn’t enough, and the number should be closer to $1.3tn.
Can we contextualise how much $300bn is?
The OECD tracks how much climate financing developing countries have mobilised so far.
In partnership with the IISD, they also track how much countries spend on fossil fuel subsidies.
Download both datasets and put them in a new Google Sheet.
We’ll also need a list of “developed countries”.
We need to pivot and filter both tables. There’s some additional cleaning we need to do as well, for example, making sure the units are consistent between the two datasets.
Copy the data into Datawrapper and visualise it as a grouped bar chart.