%pip install duckduckgo_search
%pip install trafilatura
%pip install openai
%pip install pydantic
%pip install pandas
Identifying fossil fuel lobbyists with AI
Without looking it up, would you know if the “Clean Resource Innovation Network” is an oil and gas lobbying organisation?
These kinds of organisations are often present at events like the Bonn Climate Change Conference, the WEF in Davos, or the COP events. But it’s not always clear what interests they represent.
The work of uncovering their identities can be time-consuming and challenging. At COP28 in Dubai in 2023, some 85,000 people registered to attend. Figuring out how many of them are there to push the fossil fuel industry’s agenda is a daunting task.
Kick Big Polluters Out (of which Global Witness is a member) have been doing this work every year. We found 2,456 fossil fuel lobbyists who have been granted access to COP28 last year.
This workshop will outline an approach to using web scraping and Large Language Models (LLMs), like those powering ChatGPT, to help speed up this kind of work. These techniques could also be adapted to other climate projects, such as categorising meetings between lobbyists and politicians, or identifying climate disinformation campaigns.
We will be looking at COP28 attendees and attempting to pick out the fossil fuel lobbyists. You can download the notebook and run it yourself, or you can run it in the cloud using Google Colab. Both links also in the sidebar to the right, or at the bottom of the page on mobile.
Install and load libraries
# %pip install --upgrade duckduckgo-search
import requests
import os
import pandas as pd
from duckduckgo_search import DDGS
from trafilatura import extract
from enum import Enum
from pydantic import BaseModel, Field
from openai import OpenAI
import json
You will need an API key from OpenAI or another provider.
Prep data
The UNFCCC website published an Excel sheet of COP28 participants. Let’s download it to our local project.
= 'data/plop28.xlsx' cop_file
= {
headers 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'
}
= "https://unfccc.int/sites/default/files/resource/PLOP%20COP28_on-site.xlsx"
url
= requests.get(url, headers=headers)
response
if response.status_code == 200:
# Check if the 'data' folder exists, create it if it doesn't
if not os.path.exists("data"):
"data")
os.makedirs(
with open("data/plop28.xlsx", "wb") as file:
file.write(response.content)
print("File downloaded successfully.")
else:
print(f"Failed to download file. Status code: {response.status_code}")
The participants are spread across multiple sheets, we need to read them in separately and bind them together.
# Create an empty list to store the dataframes from each sheet
= []
cop_participants
# Read the Excel file
= pd.ExcelFile(cop_file)
xls
# Iterate through the sheets and append the dataframes to the list
for sheet_name in xls.sheet_names:
= pd.read_excel(cop_file, sheet_name=sheet_name)
df
cop_participants.append(df)
# Concatenate the dataframes into a single dataframe
= pd.concat(cop_participants, ignore_index=True) cop_participants
cop_participants.head()
nominator | name | func_title | department | organization | relation | |
---|---|---|---|---|---|---|
0 | Albania | H.E. Mr. Edi Rama | Prime Minister | Prime Minister Office | Prime Minister Office | Choose not to disclose |
1 | Albania | H.E. Ms. Mirela Furxhi | Minister of Tourism and Environment | Ministry of Tourism and Environment | Ministry of Tourism and Environment | Choose not to disclose |
2 | Albania | H.E. Ms. Belinda Balluku | Deputy Prime Minister and Minister of Infrastr... | Ministry of Infrastructure and Energy | Ministry of Infrastructure and Energy | Choose not to disclose |
3 | Albania | Ms. Lindita Rama | Spouse of the Prime Minister | Not applicable | Not applicable | Choose not to disclose |
4 | Albania | H.E. Mr. Ridi Kurtezi | Ambassador of the Republic of Albania to the UAE | Albanian Embassy in United Arab Emirates | Albanian Embassy in United Arab Emirates | Choose not to disclose |
This list contains all participants registered to attend the 2023 United Nations Climate Change Conference or Conference of the Parties (COP28).
We’re not interested in the individuals, just the organisations they represent. Classifying individuals is a much more complex task that doesn’t work well with this process. Additionally, it would be too difficult to verify the results.
Let’s extract the organisations. We’ll only keep a random 5 rows for this example, you can remove the sample
function to process the entire dataset.
= cop_participants[['organization']].drop_duplicates().sample(5)
cop_orgs cop_orgs
organization | |
---|---|
37190 | IUCN Regional Office for West Asia |
33297 | The Climate Reality Project Philippines |
18860 | Seychelles Meteorological Authorithy |
7675 | Ministry of Local Government, Lands, Regional ... |
23338 | Chairperson s Secretariat |
Search
Large Language Models are prone to hallucinations. If you ask an LLM a question it doesn’t know the answer to, it will confidently make up a plausible-sounding answer that is completely wrong. This is particularly the case with less known organisations that wouldn’t feature promionently in the training data.
Let’s ask ChatGPT if “Clean Resource Innovation Network” is a fossil fuel organisation or not.
= "Clean Resource Innovation Network" org
def make_request_openai_simple(prompt_system: str, prompt_user: str, model: str = "gpt-4o-mini", **kwargs) -> str:
= OpenAI(
client # api_key = ''
)= client.beta.chat.completions.parse(
response =model,
model=0,
temperature=[
messages"role": "system", "content": prompt_system},
{"role": "user", "content": prompt_user}
{
]
)return response.choices[0].message.content
="You are an AI whose job it is to help researchers identify fossil fuel organisations",
make_request_openai_simple(prompt_system=f"Is ${org} a fossil fuel organisation? Respond with YES or NO") prompt_user
'NO'
One way to minimise (but not completely eliminate) hallucinations is Retrieval-Augmented Generation (RAG). Very simply, this means providing the AI with some additional factual context for the question you’re asking.
One example comes from The San Francisco Chronicle, who launched a chatbot that answers questions about Kamala Harris.
In our case, we’ll provide the AI with relevant search results related to our organisation so that it knows who we’re asking about.
We’ll use DuckDuckGo because it has a free API. For better results, you can use the Google API or a SERP API.
Let’s search for the organisation and extract the first 5 results.
= f'"{org}" oil gas coal'
search print(search)
= DDGS().text(search, max_results=5)
results results
"Clean Resource Innovation Network" oil gas coal
[{'title': 'Clean Resource Innovation Network',
'href': 'https://www.cleanresourceinnovation.com/',
'body': 'The Clean Resource Innovation Network (CRIN) is a pan-Canadian network founded to enable cleaner energy development by commercializing and adopting technologies for the oil and gas industry. We bring together diverse expertise from industry, entrepreneurs, investors, academia, governments, and many others to enable solutions that improve the ...'},
{'title': '17 new technologies funded by CRIN competition to address economic ...',
'href': 'https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/',
'body': "March 9, 2022 CALGARY, Alberta - Clean Resource Innovation Network (CRIN) Today CRIN is announcing funding of over $44 million CAD for 17 projects identified through its Reducing Environmental Footprint oil and gas technology competition. This brings the total investment through three competitions to $80 million. CRIN's competitions are designed to…"},
{'title': 'CRIN Funds an Additional Nineteen Projects through the Oil - GlobeNewswire',
'href': 'https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html',
'body': 'The Clean Resource Innovation Network (CRIN) is proud to announce the funding of nineteen (19) additional high-impact projects, totaling $16.1 million CAD in support. With this new commitment ...'},
{'title': 'New technologies identified for funding by CRIN competitions will ...',
'href': 'https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/',
'body': "CALGARY, Alberta, Jan. 26, 2022 (GLOBE NEWSWIRE) -- Clean Resource Innovation Network (CRIN) Achieving CRIN's goal of reducing 100 megatonnes of CO2 equivalent (CO2e) emissions from producing Canada's oil and gas resources by 2033 is within reach! Today, CRIN is announcing the first projects identified for funding awards through CRIN's $80…"},
{'title': 'Clean Resource Innovation Network (CRIN) on LinkedIn: Oil & Gas ...',
'href': 'https://www.linkedin.com/posts/crin_oil-gas-cleantech-challenge-activity-7231359278865952768-q2m_',
'body': 'The 2024 Colorado Oil & Gas Cleantech Challenge, ... Clean Resource Innovation Network (CRIN) 7,102 followers 2d Report this post The 2024 Colorado Oil & Gas Cleantech Challenge, a product ...'}]
Scrape search results
Now, we want to extract the text from each of those URLs. We’ll use Trafilatura, a library that will help us extract the main text without headers, footers and other irrelevant text.
def extract_text(urls):
= []
results
for url in urls:
print(f"Scraping {url}...")
try:
= requests.get(url, timeout=30, verify=False) # Note: verify=False is not recommended for production use
response # Raises an HTTPError for bad responses
response.raise_for_status() = extract(response.text, output_format="markdown")
extracted_text
results.append((url, extracted_text))except requests.RequestException as e:
print(f"Error scraping {url}: {str(e)}")
f"Error: {str(e)}"))
results.append((url,
return results
# Run the function
= extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes texts
Scraping https://www.cleanresourceinnovation.com/...
Scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/...
Error scraping https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/03/17-new-technologies-funded-by-crin-competition-to-address-economic-challenges-of-canadas-oil-and-gas-industry/
Scraping https://www.globenewswire.com/news-release/2023/11/03/2773454/0/en/CRIN-Funds-an-Additional-Nineteen-Projects-through-the-Oil-Gas-Technology-Competitions.html...
Scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/...
Error scraping https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/: 403 Client Error: Forbidden for url: https://energynow.ca/2022/01/new-technologies-identified-for-funding-by-crin-competitions-will-enable-emissions-reduction-and-improve-safety-in-oil-and-gas/
Scraping https://www.linkedin.com/posts/crin_oil-gas-cleantech-challenge-activity-7231359278865952768-q2m_...
# Paste text together
= "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip() prompt_documents
= 'You will be provided with a collection of documents collected from Google search results. Your task is to determine whether an organization is a fossil fuel company or lobbying group or not.'
prompt_system
= f'''
prompt_instructions## Instructions
You are a researcher investigating whether "{org}" is a fossil fuel organization.
A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.
Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''
Send request to LLM
There are various LLMs available, each with different capabilities and costs.
For our task, there are a few things we need to consider:
- Performance: Is the model intelligent enough to understand the task?
- Cost: If you are running tens of thousands of requests, the cost can add up quickly. Models like Claude 3 Opus quickly become unaffordable.
- Rate limits: Some platforms impose limits on how many times you can call the API in a given time period (minute, hour, day) and how big the requests can be.
- Other features: Some models offer additional features like better support for various languages, prompt caching, or structured outputs.
We’ll use OpenAI’s gpt-4o-mini for this classification. One advantage of this particular model is its support for Structured Outputs. This means you can force the response to follow a certain set of rules.
Let’s define what we want the output to be.
class Classification(BaseModel):
bool = Field(description = "Is this a fossil fuel organization?")
fossil_fuel_link: str = Field(description = "A brief explanation of your decision, in English")
explanation: str = Field(description = "A link to the SINGLE most relevant source that supports your classification") source:
Now, let’s make the request to OpenAI. First, we define a function like we did before, with a few tweaks.
def make_request_openai(prompt_system: str, prompt_instructions: str, prompt_documents: str, model: str = "gpt-4o-mini", **kwargs) -> str:
"""Make a request to OpenAI models that support structured outputs."""
= OpenAI(
client # api_key = ''
)= client.beta.chat.completions.parse(
response =model,
model=0,
temperature=[
messages"role": "system", "content": prompt_system},
{"role": "user", "content": f'${prompt_documents}\n\n${prompt_instructions}'}
{
],=Classification
response_format
)return response.choices[0].message.content
Now, let’s run the function on our example.
= make_request_openai(prompt_system, prompt_instructions, prompt_documents)
openai_response print(openai_response)
{"fossil_fuel_link":true,"explanation":"The Clean Resource Innovation Network (CRIN) is focused on enabling cleaner energy development specifically for the oil and gas industry. It supports projects that aim to improve the environmental performance of this sector, which indicates a direct involvement with fossil fuels. The organization is dedicated to commercializing technologies that benefit the oil and gas industry, which aligns with the characteristics of a fossil fuel organization.","source":"https://www.cleanresourceinnovation.com/"}
Scale
We’ve seen an example of how to classify one organisation. The advantage of using AI is that we can scale this to (hundreds of) thousands of operations.
Let’s define a few functions to help us with this.
def classify_org(org: str):
= f'"{org}" oil gas coal'
search
= DDGS().text(search, max_results=5)
results
= extract_text([result['href'] for result in results if 'href' in result])
texts = [(url, text) for url, text in texts if text is not None] # remove empty scrapes
texts
= "\n\n".join(f"URL: {url}\n{text}" for url, text in texts).strip()
prompt_documents = f'''
prompt_instructions## Instructions
You are a researcher investigating whether "{org}" is a fossil fuel organization.
A fossil fuel organization:
- Aims to influence policy or legislation in the interests of fossil fuel companies and shareholders.
- Has significant business activities in exploration, extraction, refining, trading, specialized transportation of oil, gas, coal, or blue hydrogen, or sale of electricity derived from them.
- Publicly declares involvement in fossil fuels or promotes significant investments in such companies.
- Can be an NGO, foundation, think tank, or lobbying group funded by fossil fuel companies or their executives.
- May include larger companies that own fossil fuel subsidiaries (e.g., BASF owning Wintershall).
- Includes companies selling energy from fossil fuels (e.g., Octopus Energy).
- Companies that currently produce or sell fossil fuels, regardless of their plans to divest in the future.
Analyze the text above, which was extracted from an internet search for "{org}", to determine if it is a fossil fuel organization. Use common sense and respond only in English, even if the original content is not in English.
'''
= make_request_openai(prompt_system, prompt_instructions, prompt_documents)
openai_response
return openai_response
def apply_classify_org(df):
'classification'] = df.apply(lambda row: classify_org(org = row['organization']), axis=1)
df['classification'] = df['classification'].apply(json.loads)
df[= pd.concat([df.drop(['classification'], axis=1), df['classification'].apply(pd.Series)], axis=1)
df
return df
Now let’s run this on our sample of organisations.
= apply_classify_org(cop_orgs) cop_orgs_classified
Scraping https://climatereality.ph/reenergizeph/...
Scraping https://climatereality.ph/2021/08/21/climate-reality-ph-geop-a-potent-weapon-against-unreliable-coal-sourced-power/...
Scraping https://climatereality.ph/2023/09/28/climate-reality-ph-builds-momentum-for-geop-implementation-in-mindanao/...
Scraping https://mirror.pia.gov.ph/news/2021/08/22/climate-reality-geop-a-potent-weapon-vscoal-sourced-power...
Scraping https://www.facebook.com/climaterealityphilippines/posts/green-energy-option-program-a-potent-weapon-against-unreliable-coal-sourced-powe/4243220149098727/...
cop_orgs_classified
organization | fossil_fuel_link | explanation | source | |
---|---|---|---|---|
37190 | IUCN Regional Office for West Asia | False | The IUCN Regional Office for West Asia is prim... | https://www.iucn.org/regions/west-asia |
33297 | The Climate Reality Project Philippines | False | The Climate Reality Project Philippines focuse... | https://climatereality.ph/reenergizeph/ |
18860 | Seychelles Meteorological Authorithy | False | The Seychelles Meteorological Authority is a g... | https://www.seychelles.gov.sc/Departments/mete... |
7675 | Ministry of Local Government, Lands, Regional ... | False | The Ministry of Local Government, Lands, Regio... | https://www.example.com/ministry-local-government |
23338 | Chairperson s Secretariat | False | The 'Chairperson's Secretariat' does not appea... | https://www.example.com/chairpersons-secretariat |
'data/cop_orgs_classified.csv') cop_orgs_classified.to_csv(
What next?
There are lots of things we can improve about this process. Here are some ideas:
- Play around with the prompt.
- Change search engine. DuckDuckGo is free and good for a prototype. However, their API isn’t meant to be used to this way and will often deny requests. It also doesn’t return the best results. I recomment switching to Google.
- Try other models. If you find that gpt-4o-mini is insufficient, you can use the smarter gpt-4o.
- Validate the output. If you use other models without Structured Output support, you can use Guardrails to validate their output. It also lets you validate other things, like the language of the output.
- Cache things. Don’t start over if something goes wrong, save the search results, scrapes and LLM outputs and continue where you left off.
- Multithreading. You can use Python’s multithreading to run multiple classifications in parallel, significantly speeding up the process.
- Verify. LLMs are still dumb and shouldn’t be trusted. Manually verify the classifications if you’re going to publish the results!