Identifying fossil fuel lobbyists with AI

Without looking it up, would you know if the “Clean Resource Innovation Network” is an oil and gas lobbying organisation? Lobbyists often hide under unintuitive or misleading affiliations that obscure their origins. The work of uncovering their identities can be time-consuming and challenging.

Kick Big Polluters Out (of which Global Witness is a member) have been doing this work every year. We found 2,456 fossil fuel lobbyists who have been granted access to the COP28 summit in Dubai last year.

This kind of work involves a lot of research and manual work. Are there ways to speed this work up?

This workshop will outline an approach to using web scraping and Large Language Models (LLMs), like those powering ChatGPT, to systematically identify organisations that are affiliated with the fossil fuel industry. These techniques could also be adapted to other climate projects, such as identifying climate disinformation.

You can download the notebook and run it yourself, or you can run it in the cloud using Google Colab. Both links also in the sidebar to the right, or at the bottom of the page on mobile.

Install and load libraries

%pip install duckduckgo_search
%pip install trafilatura
%pip install openai
%pip install pydantic
%pip install pandas
# %pip install --upgrade duckduckgo-search
import requests
import os
import pandas as pd
from duckduckgo_search import DDGS
from trafilatura import extract
from enum import Enum
from pydantic import BaseModel, Field
from openai import OpenAI
import json

You will need an API key from OpenAI or another provider.

Prep data

The UNFCCC website published an Excel sheet of COP28 participants. Let’s download it to our local project.

cop_file = 'data/plop28.xlsx'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'
}

url = "https://unfccc.int/sites/default/files/resource/PLOP%20COP28_on-site.xlsx"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    # Check if the 'data' folder exists, create it if it doesn't
    if not os.path.exists("data"):
        os.makedirs("data")
    
    with open("data/plop28.xlsx", "wb") as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

The participants are spread across multiple sheets, we need to read them in sparately and bind them together.

# Create an empty list to store the dataframes from each sheet
cop_participants = []

# Read the Excel file
xls = pd.ExcelFile(cop_file)

# Iterate through the sheets and append the dataframes to the list
for sheet_name in xls.sheet_names:
  df = pd.read_excel(cop_file, sheet_name=sheet_name)
  cop_participants.append(df)

# Concatenate the dataframes into a single dataframe
cop_participants = pd.concat(cop_participants, ignore_index=True)
cop_participants.head()
nominator name func_title department organization relation
0 Albania H.E. Mr. Edi Rama Prime Minister Prime Minister Office Prime Minister Office Choose not to disclose
1 Albania H.E. Ms. Mirela Furxhi Minister of Tourism and Environment Ministry of Tourism and Environment Ministry of Tourism and Environment Choose not to disclose
2 Albania H.E. Ms. Belinda Balluku Deputy Prime Minister and Minister of Infrastr... Ministry of Infrastructure and Energy Ministry of Infrastructure and Energy Choose not to disclose
3 Albania Ms. Lindita Rama Spouse of the Prime Minister Not applicable Not applicable Choose not to disclose
4 Albania H.E. Mr. Ridi Kurtezi Ambassador of the Republic of Albania to the UAE Albanian Embassy in United Arab Emirates Albanian Embassy in United Arab Emirates Choose not to disclose

This list contains all participants registered to attend the 2023 United Nations Climate Change Conference or Conference of the Parties (COP28).

We’re not interested in the individuals, just the organisations they represent. Classifying individuals is a much more complex task that doesn’t work well with this process. Additionally, it would be too difficult to verify the results.

Let’s extract the organisations. We’ll only keep a random 20 rows for this example, you can remove the sample function to process the entire dataset.

cop_orgs = cop_participants[['organization']].drop_duplicates().sample(5)
cop_orgs
organization
37190 IUCN Regional Office for West Asia
33297 The Climate Reality Project Philippines
18860 Seychelles Meteorological Authorithy
7675 Ministry of Local Government, Lands, Regional ...
23338 Chairperson s Secretariat

Scrape search results

Now, we want to extract the text from each of those URLs. We’ll use Trafilatura, a library that will help us extract the main text without headers, footers and other irrelevant text.

def extract_text(urls):
    results = []

    for url in urls:
        print(f"Scraping {url}...")
        try:
            response = requests.get(url, timeout=30, verify=False)  # Note: verify=False is not recommended for production use
            response.raise_for_status()  # Raises an HTTPError for bad responses
            extracted_text = extract(response.text, output_format="markdown")
            results.append((url, extracted_text))
        except requests.RequestException as e:
            print(f"Error scraping {url}: {str(e)}")
            results.append((url, f"Error: {str(e)}"))

    return results
# Run the function
texts = extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes
Scraping https://www.cleanresourceinnovation.com/...
Scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/...
Error scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/
Scraping https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html...
Scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/...
Error scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/
Scraping https://www.linkedin.com/posts/crin_oil-gas-cleantech-challenge-activity-7231359278865952768-q2m_...
# Paste text together
prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()
prompt_system = 'You will be provided with a collection of documents collected from Google search results. Your task is to determine whether an organization is a fossil fuel company or lobbying group or not.'

prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

Send request to LLM

There are various LLMs available, each with different capabilities and costs.

For our task, there are a few things we need to consider:

  • Performance: Is the model intelligent enough to understand the task?
  • Cost: If you are running tens of thousands of requests, the cost can add up quickly. Models like Claude 3 Opus quickly become unaffordable.
  • Rate limits: Some platforms impose limits on how many times you can call the API in a given time period (minute, hour, day) and how big the requests can be.
  • Other features: Some models offer additional features like better support for various languages, prompt caching, or structured outputs.

We’ll use OpenAI’s gpt-4o-mini for this classification. One advantage of this particular model is its support for Structured Outputs. This means you can force the response to follow a certain set of rules.

Let’s define what we want the output to be.

class Classification(BaseModel):
    fossil_fuel_link: bool = Field(description = "Is this a fossil fuel organization?")
    explanation: str = Field(description = "A brief explanation of your decision, in English")
    source: str = Field(description = "A link to the SINGLE most relevant source that supports your classification")

Now, let’s make the request to OpenAI. First, we define a function like we did before, with a few tweaks.

def make_request_openai(prompt_system: str, prompt_instructions: str, prompt_documents: str, model: str = "gpt-4o-mini", **kwargs) -> str:
    """Make a request to OpenAI models that support structured outputs."""
    client = OpenAI(
        # api_key = ''
        )
    response = client.beta.chat.completions.parse(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": prompt_system},
            {"role": "user", "content": f'${prompt_documents}\n\n${prompt_instructions}'}
        ],
        response_format=Classification
    )
    return response.choices[0].message.content

Now, let’s run the function on our example.

openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)
print(openai_response)
{"fossil_fuel_link":true,"explanation":"The Clean Resource Innovation Network (CRIN) is focused on enabling cleaner energy development specifically for the oil and gas industry. It supports projects that aim to improve the environmental performance of this sector, which indicates a direct involvement with fossil fuels. The organization is dedicated to commercializing technologies that benefit the oil and gas industry, which aligns with the characteristics of a fossil fuel organization.","source":"https://www.cleanresourceinnovation.com/"}

Scale

We’ve seen an example of how to classify one organisation. The advantage of using AI is that we can scale this to (hundreds of) thousands of operations.

Let’s define a few functions to help us with this.

def classify_org(org: str):
    search = f'"{org}" oil gas coal'

    results = DDGS().text(search, max_results=5)

    texts = extract_text([result['href'] for result in results if 'href' in result])
    texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes

    prompt_documents = "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()
    prompt_instructions= f'''
## Instructions

You are a researcher investigating whether "{org}" is a fossil fuel organization.

A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.

Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''

    openai_response = make_request_openai(prompt_system, prompt_instructions, prompt_documents)

    return openai_response
def apply_classify_org(df):
    df['classification'] = df.apply(lambda row: classify_org(org = row['organization']), axis=1)
    df['classification'] = df['classification'].apply(json.loads)
    df = pd.concat([df.drop(['classification'], axis=1), df['classification'].apply(pd.Series)], axis=1)

    return df

Now let’s run this on our sample of organisations.

cop_orgs_classified = apply_classify_org(cop_orgs)
Scraping https://climatereality.ph/reenergizeph/...
Scraping https://climatereality.ph/2021/08/21/climate-reality-ph-geop-a-potent-weapon-against-unreliable-coal-sourced-power/...
Scraping https://climatereality.ph/2023/09/28/climate-reality-ph-builds-momentum-for-geop-implementation-in-mindanao/...
Scraping https://mirror.pia.gov.ph/news/2021/08/22/climate-reality-geop-a-potent-weapon-vscoal-sourced-power...
Scraping https://www.facebook.com/climaterealityphilippines/posts/green-energy-option-program-a-potent-weapon-against-unreliable-coal-sourced-powe/4243220149098727/...
cop_orgs_classified
organization fossil_fuel_link explanation source
37190 IUCN Regional Office for West Asia False The IUCN Regional Office for West Asia is prim... https://www.iucn.org/regions/west-asia
33297 The Climate Reality Project Philippines False The Climate Reality Project Philippines focuse... https://climatereality.ph/reenergizeph/
18860 Seychelles Meteorological Authorithy False The Seychelles Meteorological Authority is a g... https://www.seychelles.gov.sc/Departments/mete...
7675 Ministry of Local Government, Lands, Regional ... False The Ministry of Local Government, Lands, Regio... https://www.example.com/ministry-local-government
23338 Chairperson s Secretariat False The 'Chairperson's Secretariat' does not appea... https://www.example.com/chairpersons-secretariat
cop_orgs_classified.to_csv('data/cop_orgs_classified.csv')

What next?

There are lots of things we can improve about this process. Here are some ideas:

  • Play around with the prompt.
  • Change search engine. DuckDuckGo is free and good for a prototype. However, their API isn’t meant to be used to this way and will often deny requests. It also doesn’t return the best results. I recomment switching to Google.
  • Try other models. If you find that gpt-4o-mini is insufficient, you can use the smarter gpt-4o.
  • Validate the output. If you use other models without Structured Output support, you can use Guardrails to validate their output. It also lets you validate other things, like the language of the output.
  • Cache things. Don’t start over if something goes wrong, save the search results, scrapes and LLM outputs and continue where you left off.
  • Multithreading. You can use Python’s multithreading to run multiple classifications in parallel, significantly speeding up the process.
  • Verify. LLMs are still dumb and shouldn’t be trusted. Manually verify the classifications if you’re going to publish the results!