Intro to Data Journalism

City, University of London
https://ddj.nicu.md/city/

👋 Welcome to Data Journalism

Who am I?

My name is Nicu Calcea.

I’m a data investigative journalist and City University alumnus.

I work at Global Witness, and I was previously doing data journalism at BBC News and the New Statesman.

Some stories

Get in touch

Who are you?

  • What’s your name?
  • What course are you in?
  • Do you have any experience in data journalism?
  • Why did you choose this module?

The plan

  • Week 1: Introduction
  • Weeks 2-3: Analysis
  • Week 4: Cleaning
  • Week 5: Stories
  • Week 6: Visualisation
  • Week 7: Maps
  • Week 8: Projects
  • Week 9: Scraping
  • Week 10: Conclusions

Week 1

Introduction to Data Journalism

What is Data Journalism?

In its most simple definition, data journalism is the practice of using numbers and trends to tell a story. — Betsy Ladyzhets

Data journalism [is] finding – in data – stories that are of interest to the public and presenting them in the most appropriate manner for public use and reuse. — Bahareh Heravi

History

Why do we need data journalism?

Tell richer stories

An increasing amount of human activity is recorded with data. This means there is a data angle for almost any subject.

Be more efficient

We tell some stories every year, month or day. We can greatly simplify or automate stories, giving us more time to focus on in-depth reporting.

Be more accurate

Though not without data quality issues and ethical considerations, accuracy is central to data journalism.

Unique angles

There are now stories where a data angle is the only or main angle. By using data, journalists can create news instead of covering them.

Personalise news

Make readers invested in a story by personalising it to their postcode, age or socio-economic status.

New audiences

Data journalism is exciting (I hope). The pandemic has shown that readers will reward publishers with their clicks.

The process

Question

As with all journalism, data journalism starts with a question that the reporter wants to answer.

Source data

Data can come from government sources, third parties, or be collected by the reporter themselves.

Clean data

In most cases, you will need to filter, sort and clean up any errors or missing information in your dataset.

The process

Analyse data

How do you find the answer to your question in the data?

Review

While data doesn’t lie, data publishers do. Do your findings make sense? Can you verify them?

Present results

Communicate data in the most suitable way. Usually, but not always, you will visualise your findings.

The process

Baby names

  1. Make a copy of this spreadsheet and pick one tab to work in. Data from the ONS.
  2. The yellow cells indicate where you need to fill in formulas.
  3. What are some other potential stories that you can think of? Are there more babies named after the royals? What about Game of Thrones characters? What are the most popular gender-neutral names? Long-term trends?

[=A1+A2]

Returns one number added (+) or subtracting (*) another.

[=A1/B$1]

Returns one number divided (/) or multiplied (*) by another.

[=SUM()]

Returns the sum of a series of numbers and/or cells.

=AVERAGE()

Returns the numerical average value in a dataset, ignoring text.

=MEDIAN()

Returns the median value in a numeric dataset.

[=(NEW-OLD)/OLD]

Shows percentage change.

Averages

=MODE()

Finds the most common value in a range.

   

=MEDIAN()

Finds the value that’s right in the middle of a dataset.

   

=AVERAGE()

Sum all the values and divide by the number of records.

How did the Mail do it?

2018

2019

Assignments

Critique a data journalism project
  • A 20-25 minute long narrated group PowerPoint presentation critiquing a data project that won or was shortlisted for the Sigma Awards.
  • 500-word group reflection, with appropriate references.
  • A 200-word reflection on your own learning.

Deadline: Friday, 13 December, 16:00 Marking: 40% of your final mark

Data journalism portfolio
  • One news story (400 words).
  • One EITHER feature story OR news investigation (800 words) substantially based on data techniques; and published digitally with appropriate visualisations.
  • A 200 word reflective blog-post style log on you own learning journey.

Deadline: Friday, 24 January, 16:00 Marking: 60% of your final mark

Contact

Week 2

Introduction to Data Journalism
https://ddj.nicu.md/city/

Sourcing data

Open Data

Plan ahead

Closed data

FOIs

Scraping

Census exercise

  1. Make a copy of this spreadsheet (here’s the original data).
  2. Before you do anything else, what are some questions you would like to answer?
  3. Free first row, filter the table to the region you’re from (or London)
  4. Fill in the columns at the end
  5. What are some other potential stories that you can think of?

[=A1+A2]

Returns one number added (+) or subtracting (*) another.

[=A1/B$1]

Returns one number divided (/) or multiplied (*) by another.

[=SUM()]

Returns the sum of a series of numbers and/or cells.

=AVERAGE()

Returns the numerical average value in a dataset, ignoring text.

=MEDIAN()

Returns the median value in a numeric dataset.

[=(NEW-OLD)/OLD]

Shows percentage change.

[=IF()]

Returns one value if the result is true, another if it’s false.

[=COUNTIF()]

Count all the cells that match a condition.

[=SUMIF()]

Sum all the cells that match a condition.

[=CONCATENATE()]

Combine multiple bits of text together. Use =SPLIT() for the opposite.

[=VLOOKUP()]

Match the values in a cell with the corresponding row in another dataset.

[=XLOOKUP()]

Same as =XLOOKUP() but more flexible and easier to grasp.

Contact

Week 3

Introduction to Data Journalis
https://ddj.nicu.md/city/

Toolbox

XLOOKUP

=XLOOKUP(search_term, col_to_search, col_to_return)

XLOOKUP exercise

  1. Make a copy of this spreadsheet.
  2. Fill in the empty columns with formulas we learned last time.

Pivot Tables

Pivot tables are extra tables in your spreadsheet, in which you can summarise data from your original table.

You can calculate averages, counts, max/min values or sums for numbers in a group.

Pivot table exercise

  1. Make a copy of this spreadsheet.

Bonus points: Grab a CSV from police.uk and do it yourself.

Averages

=MODE()

Finds the most common value in a range.

   

=MEDIAN()

Finds the value that’s right in the middle of a dataset.

   

=AVERAGE()

Sum all the values and divide by the number of records.

Contact

Week 4

Introduction to Data Journalis
https://ddj.nicu.md/city/

Make your spreadsheets fact-checkable

It’s all about:

Documentation

  • keep the reference to the source data unaltered
  • if needed add a data dictionary or write down what you did
  • use sensible names for your columns (not “weight”) and sheets (not “Sheet1”)
  • one question = one sheet

Reproducibility

  • do analysis on a new sheet, never on the original data
  • use range or sheet protection
  • back up often

Data prep

Missing data

Sometimes, records disappear or were never collected. It may not always be the obvious when that is the case.

Duplicated data

Records can be repeated, either due to technical mishaps or due to repeated input.

Misspellings

Humans make mistakes. Assume any dataset manually created by humans to have missspelings.


Poor formatting

Some spreadsheets are designed to be read by humans, not computers. We then need to teach software to read it correctly.


Parsing errors

Wrong Excel formula? Üṅṛëċöġṅïṡëḋ characters? Old Excel version? These can all mess with your data.


Inconsistent data

A column can have different unites, spell categories differently or record different methodologies.

The guide to bad data

Baby names exercise

  1. Download the file with baby names in England and Wales for girls or boys.
  2. Upload the spreadsheet to Google Drive and open it in Google Sheets.
  3. Find and clean Table 1.
  4. Read about Shart.

OpenRefine

Download OpenRefine

Food hygiene data

Food hygiene exercise

  • Make a copy of the food hygiene ratings data.
  • Answer these questions:
    1. What kind of establishments are the best rated in Islington?
    2. What establishment with a rating of 0-1 is due a follow-up visit?
    3. How many establishments received a 5-star rating this month?
    4. What is the cleanest hotspots (highest average hygiene with 10+ ratings)?
    5. Any other interesting angles?

Group assignment

You will be split into teams of (tbd) to create a narrated PowerPoint presentation critiquing a data project featured in the Sigma Awards. You can choose a winner or a short-listed project.

   

Deadline: Friday, 13 December, 16:00

   

Deliverables

  • A 20-25 minute long narrated PowerPoint presentation.
  • A 500-word group reflection, with appropriate references (Harvard and I suggest using Zotero as citation manager).
  • Fill in this spreadsheet with your name and the project you’re working on.

Contact

Week 5

Data Stories
https://ddj.nicu.md/city/

Why do we visualise data?

Summarising data, like we did in previous lessons, is not always enough to reveal pattern or trends.

Visualising it can provide insight we’d otherwise lose out on.

What can we visualise?

Position

Size

    Width

    Height

    Area

Colour

    Fill

    Colour

    Opacity

    Pattern

Shape

Location

People suck at sizes

What visual encoding can you find in this chart?

The FT’s scatter plot

Standard scatter plot

Change scale to log

Size by population

Colour by continent

Animate over time

FT Vocabulary

Many types of charts

Gapminder

Gapminder exercise

  1. Download the “Life expectancy” and “Fertiliy rate” (and “Population” if you want) datasets from Gapminder.
  2. Use XLOOKUP to join the datasets.
  3. Reproduce Hans Rosling’s chart in Google Sheets.

Gapminder exercise

Police crime exercise

  1. Download the Police recorded crime open data Police Force Area tables, year ending March 2013 onwards table.
  2. Working in groups of two, think of stories to tell based on this data.
  3. Write a couple headlines based on your analysis.
  4. Choose a headline and make a chart.

Contact

Week 6

Data [Visualisation]
https://ddj.nicu.md/city/

Dataviz Toolbox

Bar/column charts

Horizontal or vertical rectangles with lengths proportional to the values that they represent.

Good for comparing across different values or showing a trend over time.

Line chart

Shows values on a continuous scale. Similar to a scatter plot, except all dots are connected.

Good for showing trends over time.

Area chart

Similar to a line chart but the area underneath the line is coloured in. When stacked, it can show multiple data series as well as their cumulative trend.

Good for showing trends over time.

Scatter plot

Plots a dataset across two continuous dimensions, each on a different axis (X and Y).

Good for showing correlation between different data series.

Map

Only works with geographical data (duh!).

Even with geographical data, other charts can often be a better choice.

Anatomy of a chart

Colour scales

Colour Toolbox

Dataviz Fonts

Bad charts: Daily Mail

Bad charts: The Sun

Bad charts: Reuters

Bad charts: CBS News

WTF Visualizations

Cities exercise

  1. Create a new chart in Datawrapper.
  2. Select the Rural and urban dataset.
  3. Make a stacked bar chart.

Gapminder exercise

  1. Download the Gapminder data.
  2. Copy the data into Datawrapper.
  3. Create a scatter plot.
  4. Customise axis labels, colours, size, tooltips, title, description and source.
  5. Add annotations.
  6. Publish or download the chart.

Flourish example (on your own)

  1. Copy this spreadsheet.
  2. Copy the data into Flourish.
  3. Create a bar chart race.
  4. Customise axis labels, colours, size, title, description and source.
  5. Add annotations (optionally).
  6. Publish or download the chart.

Flourish example (on your own)

Contact

Week 7

Maps
https://ddj.nicu.md/city/

Choropleth

Pre-defined areas such as countries, regions or districts are coloured (either sequential, diverging or categorical) in proportion to values in a dataset.

Bubble/symbol map

Circles are drawn on top of a map, with their size or colour proportional to values in a dataset.

Cartogram / hex map

Cartograms resize regions in proportion to a variable in your dataset, such as population.

Hex maps standardise administrative units into same-sizes hexagons, squares or triangles.

The 2019 general election

Not always the right choice

Try to only use maps when there’s a geographical pattern to your data.

Don’t make a map if it’s going to basically be a population map.

What data you’ll need

Geographical data

This can be polygons (areas), lines or points. Some tools have a few options by default, or you can get additional ones from the ONS, Natural Earth or ArcGIS.

Numerical or categorical data

This is the data that will be placed in the shapes on your map. Normally contains region IDs or coordinates (latitude and longitude).

Internet speed exercise

  1. Make a copy of the internet speed in Europe data.
  2. Create a choropleth map (NUTS3, 2021).
  3. Copy the data in.
  4. Customise colours, title, description and source.
  5. Add annotations.
  6. Publish or download the chart.

Internet speed - example

UK elections exercise

  1. Donwload this CSV.
  2. Create a symbol map in Datawrapper (UK constituencies, 2023).
  3. Copy the data in.
  4. Customise colours, title, description and source.
  5. Publish or download the chart.

Missing migrants exercise (Flourish)

  1. Go to the missing migrants tab in the spreadsheet you’ve copied earlier.
  2. Copy the data into Datawrapper or Flourish.
  3. Create a symbol map.
  4. Customise it (colour, size, opacity, etc).

Missing migrants - example

Contact

Week 8

Data Projects
https://ddj.nicu.md/city/

Often, the best narratives warrant going further than simple graphics.

Tailored visualisations designed specifically for a story will almost always be the best way to tell a story.

Interactivity can sometimes help portray the intricacies of a story better.

The Martini Glass principle

Trackers

Trackers are data visualisations connected to a data source that is periodically updated.

Examples include FiveThirtyEight’s Biden approval rating tracker, Bloomberg’s Pret Index and the New Statesman’s Covid-19 tracker.

Calculators

Calculators allow readers to input their own data and receive a result.

Examples include the FT personal data worth calculator, the New Statesman election calculator and the BBC’s energy calculator.

Scrollable stories

Scrollytelling is the use of a browser’s scrolling functionality to interactively tell a data story.

Examples include The Impatient List, the New York Times delta variant story and the New Statesman million years lost investigation.

Information visualisation is meant to clarify data, but too much interactivity hinders understanding by transferring responsibility from the designer to the reader to work out the important points. — Martin Stabe (FT)

Readers just want to scroll, […] if you make the reader click or do anything other than scroll, something spectacular has to happen. — Archie Tse (NYT)

Interactive graphics are not just a fun addition but can actually increase the transparency of our work, open us for criticism, and thereby, hopefully, help re-build some trust in journalism. — Gregor Aisch (Datawrapper)

Various resources

Let’s hear from you

Contact

Week 9

Scraping
https://ddj.nicu.md/city/

Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. — Cloudflare

Data can be stuck behind inaccessible formats

PDFs

Many organisations still publish data in PDFs, a proprietary format that is difficult to work with. Sometimes, they even do it on purpose.

Images

If you can’t find the data behind a chart, ask the author. If you can’t do that either, read it from the image.

Websites

If you’ve got structured information on your page, you’ll most likely be able to download in a format that you can analyse.

Use Tabula to scrape PDF tables

Digitise image charts

Digitise maps

Scrape online charts

Hidden APIs

Import HTML tables into Sheets

  1. Create a new Google Sheets document.
  2. Open this Wikipedia page in another tab.
  3. Use the =IMPORTHTML() formula to import one of the tables.

Missing persons exercise

  1. Download and install the Web Scraper browser extension.
  2. Navigate to the Missing Persons website.
  3. Scrape it.
{"_id":"missingpersons","startUrl":["https://missingpersons.police.uk/en-gb/case-search/?page=[1-10]&orderBy=dateDesc"],"selectors":[{"id":"gender-age","parentSelectors":["case"],"type":"SelectorText","selector":"a:nth-of-type(7) div:nth-of-type(1) div","multiple":false,"regex":""},{"id":"reference","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(2) div","multiple":false,"regex":""},{"id":"location","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(3) div","multiple":false,"regex":""},{"id":"case","parentSelectors":["_root"],"type":"SelectorElement","selector":"a.CaseThumbnail","multiple":true},{"id":"link","parentSelectors":["case"],"type":"SelectorLink","selector":"_parent_","multiple":false},{"id":"ethnicity","parentSelectors":["link"],"type":"SelectorText","selector":"div.Entry:nth-of-type(3) div.Value","multiple":false,"regex":""}]}

Petitions exercise

  1. Go to the UK petitions website.
  2. Scrape the name of the petition, text, number of signatures, etc.

Contact

Week 10

Recap
https://ddj.nicu.md/city/

Think of a story idea

At COP29, after delayed negations, advanced economies pledged to provide $300bn a year to low and middle income countries by 2035.

Climate vulnerable countries said this isn’t enough, and the number should be closer to $1.3tn.

Can we contextualise how much $300bn is?

Find data

The OECD tracks how much climate financing developing countries have mobilised so far.

In partnership with the IISD, they also track how much countries spend on fossil fuel subsidies.

Download both datasets and put them in a new Google Sheet.

We’ll also need a list of “developed countries”.

Clean and transform the data

We need to pivot and filter both tables. There’s some additional cleaning we need to do as well, for example, making sure the units are consistent between the two datasets.

Visualise

Copy the data into Datawrapper and visualise it as a grouped bar chart.