Identifying fossil fuel lobbyists with AI

Without looking it up, would you know if the “Clean Resource Innovation Network” is an oil and gas lobbying organisation? Lobbyists often hide under unintuitive or misleading affiliations that obscure their origins. The work of uncovering their identities can be time-consuming and challenging.

Kick Big Polluters Out (of which Global Witness is a member) have been doing this work every year. We found 2,456 fossil fuel lobbyists who have been granted access to the COP28 summit in Dubai last year.

This kind of work involves a lot of research and manual work. Are there ways to speed this work up?

This workshop will outline an approach to using web scraping and Large Language Models (LLMs), like those powering ChatGPT, to systematically identify organisations that are affiliated with the fossil fuel industry. These techniques could also be adapted to other climate projects, such as identifying climate disinformation.

You can download the notebook and run it yourself, or you can run it in the cloud using Google Colab. Both links also in the sidebar to the right, or at the bottom of the page on mobile.

Install and load libraries

%pip install duckduckgo_search
%pip install trafilatura
%pip install openai
%pip install pydantic
%pip install pandas

# %pip install --upgrade duckduckgo-search

import requests
import os
import pandas as pd
from duckduckgo_search import DDGS
from trafilatura import extract
from enum import Enum
from pydantic import BaseModel, Field
from openai import OpenAI
import json

You will need an API key from OpenAI or another provider.

Prep data

The UNFCCC website published an Excel sheet of COP28 participants. Let’s download it to our local project.

cop_file = 'data/plop28.xlsx'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'
}

url = "https://unfccc.int/sites/default/files/resource/PLOP%20COP28_on-site.xlsx"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    # Check if the 'data' folder exists, create it if it doesn't
    if not os.path.exists("data"):
        os.makedirs("data")
    
    with open("data/plop28.xlsx", "wb") as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

The participants are spread across multiple sheets, we need to read them in sparately and bind them together.

# Create an empty list to store the dataframes from each sheet
cop_participants = []

# Read the Excel file
xls = pd.ExcelFile(cop_file)

# Iterate through the sheets and append the dataframes to the list
for sheet_name in xls.sheet_names:
  df = pd.read_excel(cop_file, sheet_name=sheet_name)
  cop_participants.append(df)

# Concatenate the dataframes into a single dataframe
cop_participants = pd.concat(cop_participants, ignore_index=True)

cop_participants.head()

	nominator	name	func_title	department	organization	relation
0	Albania	H.E. Mr. Edi Rama	Prime Minister	Prime Minister Office	Prime Minister Office	Choose not to disclose
1	Albania	H.E. Ms. Mirela Furxhi	Minister of Tourism and Environment	Ministry of Tourism and Environment	Ministry of Tourism and Environment	Choose not to disclose
2	Albania	H.E. Ms. Belinda Balluku	Deputy Prime Minister and Minister of Infrastr...	Ministry of Infrastructure and Energy	Ministry of Infrastructure and Energy	Choose not to disclose
3	Albania	Ms. Lindita Rama	Spouse of the Prime Minister	Not applicable	Not applicable	Choose not to disclose
4	Albania	H.E. Mr. Ridi Kurtezi	Ambassador of the Republic of Albania to the UAE	Albanian Embassy in United Arab Emirates	Albanian Embassy in United Arab Emirates	Choose not to disclose

This list contains all participants registered to attend the 2023 United Nations Climate Change Conference or Conference of the Parties (COP28).

We’re not interested in the individuals, just the organisations they represent. Classifying individuals is a much more complex task that doesn’t work well with this process. Additionally, it would be too difficult to verify the results.

Let’s extract the organisations. We’ll only keep a random 20 rows for this example, you can remove the sample function to process the entire dataset.

cop_orgs = cop_participants[['organization']].drop_duplicates().sample(5)
cop_orgs

	organization
37190	IUCN Regional Office for West Asia
33297	The Climate Reality Project Philippines
18860	Seychelles Meteorological Authorithy
7675	Ministry of Local Government, Lands, Regional ...
23338	Chairperson s Secretariat

Search

Large Language Models are prone to hallucinations. If you ask an LLM a question it doesn’t know the answer to, it will confidently make up a plausible-sounding answer that is completely wrong. This is particularly the case with less known organisations that wouldn’t feature promionently in the training data.

Let’s ask ChatGPT if “Clean Resource Innovation Network” is a fossil fuel organisation or not.

org = "Clean Resource Innovation Network"

def make_request_openai_simple(prompt_system: str, prompt_user: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": prompt_user}
        ]
    )
    return response.choices[0].message.content

make_request_openai_simple(prompt_system="You are an AI whose job it is to help researchers identify fossil fuel organisations",
                           prompt_user=f"Is ${org} a fossil fuel organisation? Respond with YES or NO")

'NO'

One way to minimise (but not completely eliminate) hallucinations is Retrieval-Augmented Generation (RAG). Very simply, this means providing the AI with some additional factual context for the question you’re asking.

One example comes from The San Francisco Chronicle, who launched a chatbot that answers questions about Kamala Harris.

In our case, we’ll provide the AI with relevant search results related to our organisation so that it knows who we’re asking about.

We’ll use DuckDuckGo because it has a free API. For better results, you can use the Google API or a SERP API.

Let’s search for the organisation and extract the first 5 results.

search = f'"{org}" oil gas coal'
print(search)

results = DDGS().text(search, max_results=5)
results

"Clean Resource Innovation Network" oil gas coal

[{'title': 'Clean Resource Innovation Network',
  'href': 'https://www.cleanresourceinnovation.com/',
  'body': 'The Clean Resource Innovation Network (CRIN) is a pan-Canadian network founded to enable cleaner energy development by commercializing and adopting technologies for the oil and gas industry. We bring together diverse expertise from industry, entrepreneurs, investors, academia, governments, and many others to enable solutions that improve the ...'},
 {'title': '17 new technologies funded by CRIN competition to address economic ...',
  'href': 'https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/',
  'body': "March 9, 2022 CALGARY, Alberta - Clean Resource Innovation Network (CRIN) Today CRIN is announcing funding of over $44 million CAD for 17 projects identified through its Reducing Environmental Footprint oil and gas technology competition. This brings the total investment through three competitions to $80 million. CRIN's competitions are designed to…"},
 {'title': 'CRIN Funds an Additional Nineteen Projects through the Oil - GlobeNewswire',
  'href': 'https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html',
  'body': 'The Clean Resource Innovation Network (CRIN) is proud to announce the funding of nineteen (19) additional high-impact projects, totaling $16.1 million CAD in support. With this new commitment ...'},
 {'title': 'New technologies identified for funding by CRIN competitions will ...',
  'href': 'https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/',
  'body': "CALGARY, Alberta, Jan. 26, 2022 (GLOBE NEWSWIRE) -- Clean Resource Innovation Network (CRIN) Achieving CRIN's goal of reducing 100 megatonnes of CO2 equivalent (CO2e) emissions from producing Canada's oil and gas resources by 2033 is within reach! Today, CRIN is announcing the first projects identified for funding awards through CRIN's $80…"},
 {'title': 'Clean Resource Innovation Network (CRIN) on LinkedIn: Oil & Gas ...',
  'href': 'https://www.linkedin.com/posts/crin_oil-gas-cleantech-challenge-activity-7231359278865952768-q2m_',
  'body': 'The 2024 Colorado Oil & Gas Cleantech Challenge, ... Clean Resource Innovation Network (CRIN) 7,102 followers 2d Report this post The 2024 Colorado Oil & Gas Cleantech Challenge, a product ...'}]

Scrape search results

Now, we want to extract the text from each of those URLs. We’ll use Trafilatura, a library that will help us extract the main text without headers, footers and other irrelevant text.

def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results

# Run the function
texts = extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

Scraping https://www.cleanresourceinnovation.com/...
Scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/...
Error scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/
Scraping https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html...
Scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/...
Error scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/
Scraping https://www.linkedin.com/posts/crin_oil-gas-cleantech-challenge-activity-7231359278865952768-q2m_...

# Paste text together
prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()

prompt_system = 'You will be provided with a collection of documents collected from Google search results. Your task is to determine whether an organization is a fossil fuel company or lobbying group or not.'

prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

Send request to LLM

There are various LLMs available, each with different capabilities and costs.

For our task, there are a few things we need to consider:

Performance: Is the model intelligent enough to understand the task?
Cost: If you are running tens of thousands of requests, the cost can add up quickly. Models like Claude 3 Opus quickly become unaffordable.
Rate limits: Some platforms impose limits on how many times you can call the API in a given time period (minute, hour, day) and how big the requests can be.
Other features: Some models offer additional features like better support for various languages, prompt caching, or structured outputs.

We’ll use OpenAI’s gpt-4o-mini for this classification. One advantage of this particular model is its support for Structured Outputs. This means you can force the response to follow a certain set of rules.

Let’s define what we want the output to be.

class Classification(BaseModel):
    fossil_fuel_link: bool = Field(description = "Is this a fossil fuel organization?")
    explanation: str = Field(description = "A brief explanation of your decision, in English")
    source: str = Field(description = "A link to the SINGLE most relevant source that supports your classification")

Now, let’s make the request to OpenAI. First, we define a function like we did before, with a few tweaks.

def make_request_openai(prompt_system: str, prompt_instructions: str, prompt_documents: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    """Make a request to OpenAI models that support structured outputs."""
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": f'${prompt_documents}\n\n${prompt_instructions}'}
        ],
        response_format=Classification
    )
    return response.choices[0].message.content

Now, let’s run the function on our example.

openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)
print(openai_response)

{"fossil_fuel_link":true,"explanation":"The Clean Resource Innovation Network (CRIN) is focused on enabling cleaner energy development specifically for the oil and gas industry. It supports projects that aim to improve the environmental performance of this sector, which indicates a direct involvement with fossil fuels. The organization is dedicated to commercializing technologies that benefit the oil and gas industry, which aligns with the characteristics of a fossil fuel organization.","source":"https://www.cleanresourceinnovation.com/"}

Scale

We’ve seen an example of how to classify one organisation. The advantage of using AI is that we can scale this to (hundreds of) thousands of operations.

Let’s define a few functions to help us with this.

def classify_org(org: str):
    search = f'"{org}" oil gas coal'

    results = DDGS().text(search, max_results=5)

    texts = extract_text([result['href'] for result in results if 'href' in result])
    texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

    prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()
    prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

    openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)

    return openai_response

def apply_classify_org(df):
    df['classification'] = df.apply(lambda row: classify_org(org = row['organization']), axis=1)
    df['classification'] = df['classification'].apply(json.loads)
    df = pd.concat([df.drop(['classification'], axis=1), df['classification'].apply(pd.Series)], axis=1)

    return df

Now let’s run this on our sample of organisations.

cop_orgs_classified = apply_classify_org(cop_orgs)

Scraping https://climatereality.ph/reenergizeph/...
Scraping https://climatereality.ph/2021/08/21/climate-reality-ph-geop-a-potent-weapon-against-unreliable-coal-sourced-power/...
Scraping https://climatereality.ph/2023/09/28/climate-reality-ph-builds-momentum-for-geop-implementation-in-mindanao/...
Scraping https://mirror.pia.gov.ph/news/2021/08/22/climate-reality-geop-a-potent-weapon-vscoal-sourced-power...
Scraping https://www.facebook.com/climaterealityphilippines/posts/green-energy-option-program-a-potent-weapon-against-unreliable-coal-sourced-powe/4243220149098727/...

cop_orgs_classified

	organization	fossil_fuel_link	explanation	source
37190	IUCN Regional Office for West Asia	False	The IUCN Regional Office for West Asia is prim...	https://www.iucn.org/regions/west-asia
33297	The Climate Reality Project Philippines	False	The Climate Reality Project Philippines focuse...	https://climatereality.ph/reenergizeph/
18860	Seychelles Meteorological Authorithy	False	The Seychelles Meteorological Authority is a g...	https://www.seychelles.gov.sc/Departments/mete...
7675	Ministry of Local Government, Lands, Regional ...	False	The Ministry of Local Government, Lands, Regio...	https://www.example.com/ministry-local-government
23338	Chairperson s Secretariat	False	The 'Chairperson's Secretariat' does not appea...	https://www.example.com/chairpersons-secretariat

cop_orgs_classified.to_csv('data/cop_orgs_classified.csv')

What next?

There are lots of things we can improve about this process. Here are some ideas:

Play around with the prompt.
Change search engine. DuckDuckGo is free and good for a prototype. However, their API isn’t meant to be used to this way and will often deny requests. It also doesn’t return the best results. I recomment switching to Google.
Try other models. If you find that gpt-4o-mini is insufficient, you can use the smarter gpt-4o.
Validate the output. If you use other models without Structured Output support, you can use Guardrails to validate their output. It also lets you validate other things, like the language of the output.
Cache things. Don’t start over if something goes wrong, save the search results, scrapes and LLM outputs and continue where you left off.
Multithreading. You can use Python’s multithreading to run multiple classifications in parallel, significantly speeding up the process.
Verify. LLMs are still dumb and shouldn’t be trusted. Manually verify the classifications if you’re going to publish the results!