# load system libraries
import os
from dotenv import load_dotenv
load_dotenv()
# load Claude library
from anthropic import Anthropic
= Anthropic() client
Document classification in Python
These notes are mostly inspired from the Practical AI for (investigative) journalism sessions.
Google Sheets is a good way to work on smaller batches of data, but you may want to use code for larger datasets or a more robust approach. In this tutorial, we’ll use Python to classify documents based on their content.
Make sure you have Python installed on your computer, or you can run Python code in the cloud using Google Colab.
We’ll use Claude’s Haiku model for this exercise, because it’s fast, fairly smart and, most importantly, cheap.
You can use a more sophisticated model for more sophisticated tasks. Other LLM providers will have their own libraries, so you might have to adapt parts of this tutorial to your specific model.
Setting up
Create a new folder for your project somewhere on your computer and navigate to it in your terminal.
We’ll need a Claude API key to communicate with the model. Once you have your key, run the following command in your terminal:
Terminal
pip install python-dotenv
This is a library that will allow us to store our API key in a file called .env
in the root of our project. Create a new file called .env
(just the extension, without the file name) in your project folder and add the following line to it:
.env
ANTHROPIC_API_KEY=your-api-key
The reason we do this is because it’s generally a bad idea to store passwords, keys or other sensitive information directly in your code. By storing it in a separate file, we can add this file to our .gitignore
file and make sure it’s not uploaded to a public repository.
Since we’re here, let’s also install the Claude library:
Terminal
pip install anthropic
Now, create a new Python file in your project folder and name it classify.py
. We’ll write our code in this file.
In your classify.py
file, add the following code to load some libraries we need:
You can now talk to Claude directly from Python.
= client.messages.create(
message =1024,
max_tokens=[
messages
{"role": "user",
"content": "Hello, Claude",
}
],# https://docs.anthropic.com/claude/docs/models-overview
="claude-3-haiku-20240307",
model
)print(message.content[0].text)
Hello! It's nice to meet you. How can I assist you today?
Classifying documents
We can use a similar approach to classify documents as we did in the Google Sheets tutorial.
# first, build a prompt template
= """
prompt Below is the text to a piece of legislation. Classify it as one of the following categories:
- environment
- taxes
- school
- crime
- other
Only provide the category name in your response. Use only lowercase letters.
Bill text:
{text}
"""
# then load the text of the bill
= """
legisation "> HLS 24RS-53 **[REENGROSSED]{.underline}**
>
> 2024 Regular Session
>
> HOUSE BILL NO. 12
>
> BY REPRESENTATIVE JORDAN
CRIME: Provides relative to the crime of nonconsensual disclosure of a
private image
> 1 AN ACT
>
> 2 To amend and reenact R.S. 14:283.2(A)(1) and to enact R.S.
> 14:283.2(C)(5), relative to the
>
> 3 nonconsensual disclosure of private images; to provide for elements
> of the offense;
>
> 4 to provide for a definition; and to provide for related matters.
>
> 5 Be it enacted by the Legislature of Louisiana:
>
> 6 Section 1. R.S. 14:283.2(A)(1) is hereby amended and reenacted and
> R.S.
>
> 7 14:283.2(C)(5) is hereby enacted to read as follows:
>
> 8 §283.2. Nonconsensual disclosure of a private image
>
> 9 A. A person commits the offense of nonconsensual disclosure of a
> private
10 image when all of the following occur:
11 (1) The person intentionally discloses an image of another person who
is
12 seventeen years of age or older, who is identifiable from the image
or information
13 displayed in connection with the image, and [who is either engaged in
a sexual]{.underline}
14 [performance or]{.underline} whose intimate parts are exposed in
whole or in part.
15 \* \* \*
16 C. For purposes of this Section:
17 \* \* \*
18 [(5) \""Sexual performance\"" means any performance or part thereof
that]{.underline}
> 19 [includes actual or simulated sexual intercourse, deviate sexual
> intercourse, sexual]{.underline}
Page 1 of 2
> CODING: Words in ~~struck through~~ type are deletions from existing
> law; words [underscored]{.underline}
>
> are additions.
>
> HLS 24RS-53 **[REENGROSSED]{.underline}** HB NO. 12
+-----------------------------------+-----------------------------------+
| 1\ | > bestiality, masturbation, |
| 2\ | > sadomasochistic abuse, or lewd |
| 3 | > exhibition of the genitals [or |
| | > anus.]{.underline} |
| | |
| | \* \* \* |
+===================================+===================================+
+-----------------------------------+-----------------------------------+
DIGEST
> The digest printed below was prepared by House Legislative Services.
> It constitutes no part of the legislative instrument. The keyword,
> one-liner, abstract, and digest do not constitute part of the law or
> proof or indicia of legislative intent. \[R.S. 1:13(B) and 24:177(E)\]
+-----------------------+-----------------------+-----------------------+
| HB 12 Reengrossed | > 2024 Regular | Jordan |
| | > Session | |
+=======================+=======================+=======================+
+-----------------------+-----------------------+-----------------------+
> **Abstract:** Amends the elements of nonconsensual disclosure of a
> private image and provides for a definition.
>
> [Present law]{.underline} provides for the crime of nonconsensual
> disclosure of a private image and provides for elements of the
> offense, penalties, and definitions.
>
> [Proposed law]{.underline} retains [present law]{.underline}.
[Present law]{.underline} provides that a person commits this offense
when all of the following occur:
> \(1\) The person intentionally discloses an image of another person
> who is 17 years of age or older, who is identifiable from the image or
> information displayed in connection with the image, and whose intimate
> parts are exposed in whole or in part.
>
> \(2\) The person who discloses the image obtained it under
> circumstances in which a reasonable person would know or understand
> that the image was to remain private.
>
> \(3\) The person who discloses the image knew or should have known
> that the person in the image did not consent to the disclosure of the
> image.
>
> \(4\) The person who discloses the image has the intent to harass or
> cause emotional distress to the person in the image, and the person
> who commits the offense knew or should have known that the disclosure
> could harass or cause emotional distress to the person in the image.
>
> [Proposed law]{.underline} retains [present law]{.underline}, but
> changes the element relative to the disclosure of an image of an
> identifiable person to encompass [either]{.underline} the exposing of
> intimate parts of [or]{.underline} the engaging in a sexual
> performance by the identifiable person.
>
> [Present law]{.underline} defines the terms \""criminal justice
> agency\"", \""disclosure\"", \""image\"", and \""intimate parts\"".
>
> [Proposed law]{.underline} retains [present law]{.underline} and
> provides a definition for \""sexual performance\"".
>
> (Amends R.S. 14:283.2(A)(1); Adds R.S. 14:283.2(C)(5))
>
> [The House Floor Amendments to the engrossed bill:]{.underline}
>
> 1\. Clarify the elements of [present law]{.underline} relative to the
> exposure of intimate parts or the engaging in a sexual performance by
> the identifiable person.
Page 2 of 2
> CODING: Words in ~~struck through~~ type are deletions from existing
> law; words [underscored]{.underline} are additions."
"""
# then ask Claude to classify it
= client.messages.create(
message =1024,
max_tokens=[
messages
{"role": "user",
"content": prompt.format(text=legisation),
}
],="claude-3-haiku-20240307",
model
)
print(message.content[0].text)
<>:19: SyntaxWarning: invalid escape sequence '\*'
<>:19: SyntaxWarning: invalid escape sequence '\*'
C:\Users\mail\AppData\Local\Temp\ipykernel_7500\1971032123.py:19: SyntaxWarning: invalid escape sequence '\*'
legisation = """
crime
Doing it one piece of text at a time isn’t particularly useful. You can use Python to read a spreadsheet of documents and classify them all at once.
Let’s read in the spreadsheet of bills from the Google Sheets exercise.
import pandas as pd
= pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRly_QUcMdN_iIcwKdx6YZvGu8tuP9JU7DnCWUFT9nfLFloRzzxS8aSf4gTdKbU6kf47DFm05nVygrN/pub?gid=0&single=true&output=csv")
bills "../data/bills.csv", index=False)
bills.to_csv( bills
bill_text | ai_category | about retirement? | |
---|---|---|---|
0 | > HLS 24RS-94 **[ENGROSSED]{.underline}**\n>\n... | #NAME? | NaN |
1 | > HLS 24RS-88 **[REENGROSSED]{.underline}**\n>... | NaN | NaN |
2 | > HLS 24RS-53 **[REENGROSSED]{.underline}**\n>... | NaN | NaN |
3 | > 2024 Regular Session **[ENROLLED]{.underline... | NaN | NaN |
4 | > 2024 Regular Session **[ENROLLED]{.underline... | NaN | NaN |
5 | > HLS 24RS-1606 **[ORIGINAL]{.underline}**\n>\... | NaN | NaN |
6 | > HLS 24RS-2151 **[ORIGINAL]{.underline}**\n>\... | NaN | NaN |
7 | > HLS 24RS-1646 **[ENGROSSED]{.underline}**\n>... | NaN | NaN |
8 | > HLS 24RS-1553 **[ORIGINAL]{.underline}**\n>\... | NaN | NaN |
We’re now going to write a function that takes a piece of text and classifies it using Claude.
# cache results to avoid having to reclassify
from joblib import Memory
= Memory("cachedir", verbose=0)
memory @memory.cache
# define the function
def classify(row):
= """
prompt Below is the text to a piece of legislation. Classify it as one of the following categories:
- environment
- taxes
- school
- crime
- other
Only provide the category name in your response. Use only lowercase letters.
Bill text:
{text}
"""
= client.messages.create(
message =1024,
max_tokens=[
messages
{"role": "user",
"content": prompt.format(text=row['bill_text']),
}
],="claude-3-haiku-20240307",
model=0,
temperature
)
return pd.Series({
'content': message.content[0].text
})
Now, let’s apply this function to our bills dataframe.
# run the function
'ai_category'] = bills.apply(classify, axis=1)
bills[
# save the results
"../data/bills-classified.csv", index=False)
bills.to_csv(
# print the results
bills
bill_text | ai_category | about retirement? | |
---|---|---|---|
0 | > HLS 24RS-94 **[ENGROSSED]{.underline}**\n>\n... | school | NaN |
1 | > HLS 24RS-88 **[REENGROSSED]{.underline}**\n>... | crime | NaN |
2 | > HLS 24RS-53 **[REENGROSSED]{.underline}**\n>... | crime | NaN |
3 | > 2024 Regular Session **[ENROLLED]{.underline... | other | NaN |
4 | > 2024 Regular Session **[ENROLLED]{.underline... | environment | NaN |
5 | > HLS 24RS-1606 **[ORIGINAL]{.underline}**\n>\... | school | NaN |
6 | > HLS 24RS-2151 **[ORIGINAL]{.underline}**\n>\... | other | NaN |
7 | > HLS 24RS-1646 **[ENGROSSED]{.underline}**\n>... | crime | NaN |
8 | > HLS 24RS-1553 **[ORIGINAL]{.underline}**\n>\... | health | NaN |
As you can see, we now have a classified dataset of bills.