Intro to Data Journalism

City, University of London
https://ddj.nicu.md/city/

👋 Welcome to Introduction to Data Journalism

Who am I?

My name is Nicu Calcea.

I’m a data journalist and City University alumnus, originally from Moldova, currently based in London.

I work as a data journalist at BBC News, previously Data Projects Editor at the New Statesman.

Some stories

Who are you?

  • What’s your name?
  • What course are you in?
  • Do you have any experience in data journalism?
  • Why did you choose this module?

Plan

  • Week 1: Introduction
  • Weeks 2-3: Analysis
  • Week 4: Cleaning
  • Week 5: Stories
  • Week 6: Visualisation
  • Week 7: Maps
  • Week 8: Projects
  • Week 9: Scraping
  • Week 10: Conclusions

Week 1

Introduction to Data Journalism

What is Data Journalism?

In its most simple definition, data journalism is the practice of using numbers and trends to tell a story. — Betsy Ladyzhets

Data journalism [is] finding – in data – stories that are of interest to the public and presenting them in the most appropriate manner for public use and reuse. — Bahareh Heravi

History of Data Journalism

Why do we need data journalism?

Tell richer stories

An increasing amount of human activity is recorded with data. This means there is a data angle for almost any subject.

Be more efficient

We tell some stories every year, month or day. We can greatly simplify or even automate those stories, giving us more time to focus on in-depth reporting.

Be more accurate

Though not without data quality issues and ethical considerations, accuracy is central to data journalism.

Unique angles

There are now stories where a data angle is the only or main angle. By using data, journalists can create news instead of covering them.

Personalise news

Make readers invested in a story by personalising it to their postcode, age or socio-economic status.

New audiences

Data journalism is exciting (I hope). The pandemic has shown that readers like clear, beautiful data stories and will reward publishers with their clicks.

The process

Question

As is the case with all journalism, data journalism starts with a question that the reporter wants to answer.

Source data

Data can come from government sources, third parties, or be collected by the reporter themselves.

Clean data

In most cases, you will need to filter, sort and clean up any errors or missing information in your dataset.

The process

Analyse data

How do you find the answer to your question in the data?

Review

While data doesn’t lie, data publishers do. Do your findings make sense? Can you verify using other sources? Have you made any mistakes?

Present results

Communicate data in the most suitable way. Usually, you will visualise your findings, but that is not always necessary.

The lifecycle of a data release

Baby names

  1. Make a copy of this spreadsheet and pick one tab to work in. Data from the (ONS).
  2. The yellow cells indicate where you need to fill in formulas.
  3. What are some other potential stories that you can think of? Are there more babies named after the royals? What about Game of Thrones characters? What are the most popular gender-neutral names? Long-term trends?

[=A1+A2]

Returns one number added (+) or subtracting (*) another.

[=A1/B$1]

Returns one number divided (/) or multiplied (*) by another.

[=SUM()]

Returns the sum of a series of numbers and/or cells.

=AVERAGE()

Returns the numerical average value in a dataset, ignoring text.

=MEDIAN()

Returns the median value in a numeric dataset.

[=(NEW-OLD)/OLD]

Shows percentage change.

How did the Mail do it?

[2018]

[2019]

Assignments

[Critique a data journalism project]
  • A 20-25 minute long narrated group PowerPoint presentation critiquing a data project that won or was shortlisted for the Sigma Awards.
  • 500-word group reflection, with appropriate references.
  • A 200-word reflection on your own learning.

Deadline: Friday, December 10, 4pm Marking: 40% of your final mark

[Data journalism portfolio]
  • One news story (400 words).
  • One EITHER feature story OR news investigation (800 words) substantially based on data techniques; and published digitally with appropriate visualisations.
  • A 200 word reflective blog-post style log on you own learning journey.

Deadline: Friday 7 January 2021, 4pm Marking: 60% of your final mark

Contact

Week 2

Introduction to [Data Journalism]
https://ddj.nicu.md/city/

Sourcing data

Plan ahead

Closed data

FOIs

Scraping

Census exercise

  1. Make a copy of this spreadsheet (here’s the original data).
  2. Before you do anything else, what are some questions you would like to answer?
  3. Free first row, filter the table to the region you’re from (or London)
  4. Fill in the columns at the end
  5. What are some other potential stories that you can think of?

[=A1+A2]

Returns one number added (+) or subtracting (*) another.

[=A1/B$1]

Returns one number divided (/) or multiplied (*) by another.

[=SUM()]

Returns the sum of a series of numbers and/or cells.

=AVERAGE()

Returns the numerical average value in a dataset, ignoring text.

=MEDIAN()

Returns the median value in a numeric dataset.

[=(NEW-OLD)/OLD]

Shows percentage change.

[=IF()]

Returns one value if the result is true, another if it’s false.

[=COUNTIF()]

Count all the cells that match a condition.

[=SUMIF()]

Sum all the cells that match a condition.

[=CONCATENATE()]

Combine multiple bits of text together. Use =SPLIT() for the opposite.

[=VLOOKUP()]

Match the values in a cell with the corresponding row in another dataset.

[=XLOOKUP()]

Same as =XLOOKUP() but more flexible and easier to grasp.

Contact

Week 3

Introduction to [Data Journalism]
https://ddj.nicu.md/city/

Bloomberg jobs

Toolbox

XLOOKUP

=XLOOKUP(search_term, col_to_search, col_to_return)

Pivot Tables

Pivot tables are extra tables in your spreadsheet, in which you can summarise data from your original table.

You can calculate averages, counts, max/min values or sums for numbers in a group.

Averages

=MODE()

Finds the most common value in a range.

   

=MEDIAN()

Finds the value that’s right in the middle of a dataset.

   

=AVERAGE()

Sum all the values and divide by the number of records.

Census exercise v2

  1. Make a copy of this spreadsheet.
  2. Fill in the empty columns with formulas we learned last time.

Contact

Week 4

Data [Cleaning]
https://ddj.nicu.md/city/

[Missing data]

Sometimes, records disappear or were never collected. It may not always be the obvious when that is the case.

[Duplicated data]

Records can be repeated, either due to technical mishaps or due to repeated input.

[Misspellings]

Humans make mistakes. Assume any dataset manually created by humans to have missspelings.

[Poor formatting]

Some spreadsheets are designed to be read by humans, not computers. We then need to teach software to read it correctly.

[Parsing errors]

Wrong Excel formula? Üáč…áč›Ă«Ä‹Ă¶ÄĄáč…ĂŻáčĄĂ«áž‹ characters? Old Excel version? These can all mess with your data.

[Inconsistent data]

A column can have different unites, spell categories differently or record different methodologies.

The guide to bad data

Data prep

Baby names exercise

  1. Download the Baby names for girls/boys in England and Wales file.
  2. Upload the spreadsheet to Google Drive and open it in Google Sheets.
  3. Find and clean Table 1.
  4. Read about Shart.

Also check last year’s tables: here.

OpenRefine

Download OpenRefine

Food hygiene data

Food hygiene exercise

  • Make a copy of the food hygiene ratings data.
  • Answer these questions:
    1. What kind of establishments are the best rated in Islington?
    2. What establishment with a rating of 0-1 is due a follow-up visit?
    3. How many establishments received a 5-star rating this month?
    4. What is the cleanest hotspots (highest average hygiene with 10+ ratings)?
    5. Any other interesting angles?

Group assignment

You will be split into teams of four to create a narrated PowerPoint presentation critiquing a data project featured in the Sigma Awards. You can choose a winner or a short-listed project (at the bottom of the page).

   

Deadline: Friday, December 8, 4pm

   

Deliverables

  • A 20-25 minute long narrated PowerPoint presentation.
  • A 500-word group reflection, with appropriate references (Harvard and I suggest using Zotero as citation manager).
  • Fill in this spreadsheet with your name and the project you’re working on.

Contact

Week 5

Data [Stories]
https://ddj.nicu.md/city/

30-day challenge (Erwan)

30-day challenge (Jana)

Why do we visualise data?

Summarising data, like we did in previous lessons, is not always enough to reveal pattern or trends.

Visualising it can provide insight we’d otherwise lose out on.

What can we visualise?

Position

Size

    Width

    Height

    Area

Colour

    Fill

    Colour

    Opacity

    Pattern

Shape

Location

People suck at sizes

What visual encoding can you find in this chart?

The FT’s scatter plot

Standard scatter plot

Change scale to log

Size by population

Colour by continent

Animate over time

FT Vocabulary

Many types of charts

Gapminder

Gapminder exercise

  1. Make a copy of the Gapminder data.
  2. Working in groups, think of stories to tell based on this data.
  3. What would you visualise to support your story/stories?
  4. Would you look for any other datasets you can combine with this?
  5. Scout out your visualisation ideas either on pen & paper, using Google Sheets charts or using a platform like Datawrapper or Flourish.
  6. If you have time left, discuss plans for the group assignment.

Gapminder exercise

IMDb exercise

  1. Make a copy of the IMDb data.
  2. Working in groups, think of stories to tell based on this data.
  3. What would you visualise to support your story/stories?
  4. Would you look for any other datasets you can combine with this?
  5. Scout out your visualisation ideas either on pen & paper, using Google Sheets charts or using a platform like Datawrapper or Flourish.
  6. If you have time left, discuss plans for the group assignment.

Contact

Week 6

Data [Visualisation]
https://ddj.nicu.md/city/

Dataviz Toolbox

Bar/column charts

Horizontal or vertical rectangles with lengths proportional to the values that they represent.

Good for comparing across different values or showing a trend over time.

Line chart

Shows values on a continuous scale. Similar to a scatter plot, except all dots are connected.

Good for showing trends over time.

Area chart

Similar to a line chart but the area underneath the line is coloured in. When stacked, it can show multiple data series as well as their cumulative trend.

Good for showing trends over time.

Scatter plot

Plots a dataset across two continuous dimensions, each on a different axis (X and Y).

Good for showing correlation between different data series.

Map

Only works with geographical data (duh!).

Even with geographical data, other charts can often be a better choice.

Anatomy of a chart

Colour scales

Colour Toolbox

Dataviz Fonts

Bad charts: Daily Mail

Bad charts: The Sun

Bad charts: Reuters

Bad charts: CBS News

WTF Visualizations

Covid exercise

  1. Download a copy of the Covid-19 data.
  2. Filter it to only the deaths and vaccination rates in Europe.
  3. Copy the data into Datawrapper.
  4. Create a scatter plot.
  5. Customise axis labels, colours, size, tooltips, title, description and source.
  6. Add annotations.
  7. Publish or download the chart.

Covid exercise - example

Taxis vs Uber exercise

  1. Copy this spreadsheet.
  2. Copy the data into Flourish.
  3. Create a bar chart race.
  4. Customise axis labels, colours, size, title, description and source.
  5. Add annotations (optionally).
  6. Publish or download the chart.

Taxis vs Uber exercise - example

Contact

Week 7

[Maps]
https://ddj.nicu.md/city/

Choropleth

Pre-defined areas such as countries, regions or districts are coloured (either sequential, diverging or categorical) in proportion to values in a dataset.

Bubble/symbol map

Circles are drawn on top of a map, with their size or colour proportional to values in a dataset.

Cartogram / hex map

Cartograms resize regions in proportion to a variable in your dataset, such as population.

Hex maps standardise administrative units into same-sizes hexagons, squares or triangles.

The 2019 general election

Not always the right choice

Try to only use maps when there’s a geographical pattern to your data.

Don’t make a map if it’s going to basically be a population map.

What data you’ll need

Geographical data

This can be polygons (areas), lines or points. Some tools have a few options by default, or you can get additional ones from the ONS, Natural Earth or ArcGIS.

Numerical or categorical data

This is the data that will be placed in the shapes on your map. Normally contains region IDs or coordinates (latitude and longitude).

Internet speed exercise

  1. Make a copy of the internet speed in Europe data.
  2. Create a choropleth map (NUTS3, 2021).
  3. Copy the data in.
  4. Customise colours, title, description and source.
  5. Add annotations.
  6. Publish or download the chart.

Internet speed - example

Students exercise

  1. Go to the international students tab in the spreadsheet you’ve copied earlier.
  2. Create a 3D globe flow map in Flourish.
  3. Copy the data in.
  4. Customise colours, title, description and source.
  5. Publish or download the chart.

Missing migrants exercise

  1. Go to the missing migrants tab in the spreadsheet you’ve copied earlier.
  2. Copy the data into Datawrapper or Flourish.
  3. Create a symbol map.
  4. Customise it (colour, size, opacity, etc).

Missing migrants - example

Contact

Week 8

Data [Projects]
https://ddj.nicu.md/city/

Often, the best narratives warrant going further than simple graphics.

Tailored visualisations designed specifically for a story will almost always be the best way to tell a story.

Interactivity can sometimes help portray the intricacies of a story better.

The Martini Glass principle

Trackers

Trackers are data visualisations connected to a data source that is periodically updated.

Examples include FiveThirtyEight’s Biden approval rating tracker, Bloomberg’s Pret Index and the New Statesman’s Covid-19 tracker.

Calculators

Calculators allow readers to input their own data and receive a result.

Examples include the FT personal data worth calculator, the New Statesman election calculator and the BBC’s energy calculator.

Scrollable stories

Scrollytelling is the use of a browser’s scrolling functionality to interactively tell a data story.

Examples include The Impatient List, the New York Times delta variant story and the New Statesman million years lost investigation.

Information visualisation is meant to clarify data, but too much interactivity hinders understanding by transferring responsibility from the designer to the reader to work out the important points. — Martin Stabe (FT)

Readers just want to scroll, [
] if you make the reader click or do anything other than scroll, something spectacular has to happen. — Archie Tse (NYT)

Interactive graphics are not just a fun addition but can actually increase the transparency of our work, open us for criticism, and thereby, hopefully, help re-build some trust in journalism. — Gregor Aisch (Datawrapper)

Let’s hear from you

Contact

Week 9

Scraping
https://ddj.nicu.md/city/

Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. — Cloudflare

Data can be stuck behind inaccessible formats

PDFs

Many organisations still publish data in PDFs, a proprietary format that is difficult to work with. Sometimes, they even do it on purpose.

Images

If you can’t find the data behind a chart, ask the author. If you can’t do that either, read it from the image.

Websites

If you’ve got structured information on your page, you’ll most likely be able to download in a format that you can analyse.

Use Tabula to scrape PDF tables

Digitise image charts

Digitise maps

Scrape online charts

Hidden APIs

Import HTML tables into Sheets

  1. Create a new Google Sheets document.
  2. Open this Wikipedia page in another tab.
  3. Use the =IMPORTHTML() formula to import one of the tables.

Missing persons exercise

  1. Download and install the Web Scraper browser extension.
  2. Navigate to the Missing Persons website.
  3. Scrape it.
{"_id":"missingpersons","startUrl":["https://missingpersons.police.uk/en-gb/case-search/?page=[1-10]&orderBy=dateDesc"],"selectors":[{"id":"gender-age","parentSelectors":["case"],"type":"SelectorText","selector":"a:nth-of-type(7) div:nth-of-type(1) div","multiple":false,"regex":""},{"id":"reference","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(2) div","multiple":false,"regex":""},{"id":"location","parentSelectors":["case"],"type":"SelectorText","selector":"div.Detail:nth-of-type(3) div","multiple":false,"regex":""},{"id":"case","parentSelectors":["_root"],"type":"SelectorElement","selector":"a.CaseThumbnail","multiple":true},{"id":"link","parentSelectors":["case"],"type":"SelectorLink","selector":"_parent_","multiple":false},{"id":"ethnicity","parentSelectors":["link"],"type":"SelectorText","selector":"div.Entry:nth-of-type(3) div.Value","multiple":false,"regex":""}]}

The Gazette exercise

  1. Go to the the Gazette website
  2. Scrape company names, notice IDs, types, company IDs, date, etc.

Module Evaluation Survey

Take part in the module evaluation survey.

Contact