Document classification in Python

Note

These notes are mostly inspired from the Practical AI for (investigative) journalism sessions.

Google Sheets is a good way to work on smaller batches of data, but you may want to use code for larger datasets or a more robust approach. In this tutorial, we’ll use Python to classify documents based on their content.

Make sure you have Python installed on your computer, or you can run Python code in the cloud using Google Colab.

We’ll use Claude’s Haiku model for this exercise, because it’s fast, fairly smart and, most importantly, cheap.

You can use a more sophisticated model for more sophisticated tasks. Other LLM providers will have their own libraries, so you might have to adapt parts of this tutorial to your specific model.

Setting up

Create a new folder for your project somewhere on your computer and navigate to it in your terminal.

We’ll need a Claude API key to communicate with the model. Once you have your key, run the following command in your terminal:

Terminal
pip install python-dotenv

This is a library that will allow us to store our API key in a file called .env in the root of our project. Create a new file called .env (just the extension, without the file name) in your project folder and add the following line to it:

.env
ANTHROPIC_API_KEY=your-api-key

The reason we do this is because it’s generally a bad idea to store passwords, keys or other sensitive information directly in your code. By storing it in a separate file, we can add this file to our .gitignore file and make sure it’s not uploaded to a public repository.

Since we’re here, let’s also install the Claude library:

Terminal
pip install anthropic

Now, create a new Python file in your project folder and name it classify.py. We’ll write our code in this file.

In your classify.py file, add the following code to load some libraries we need:

# load system libraries
import os
from dotenv import load_dotenv
load_dotenv()

# load Claude library
from anthropic import Anthropic
client = Anthropic()

You can now talk to Claude directly from Python.

message = client.messages.create(
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Hello, Claude",
        }
    ],
    # https://docs.anthropic.com/claude/docs/models-overview
    model="claude-3-haiku-20240307",
)
print(message.content[0].text)
Hello! It's nice to meet you. How can I assist you today?

Classifying documents

We can use a similar approach to classify documents as we did in the Google Sheets tutorial.

# first, build a prompt template
prompt = """
Below is the text to a piece of legislation. Classify it as one of the following categories:

- environment
- taxes
- school
- crime
- other

Only provide the category name in your response. Use only lowercase letters.

Bill text:

{text}
"""

# then load the text of the bill
legisation = """
"> HLS 24RS-53 **[REENGROSSED]{.underline}**
>
> 2024 Regular Session
>
> HOUSE BILL NO. 12
>
> BY REPRESENTATIVE JORDAN

CRIME: Provides relative to the crime of nonconsensual disclosure of a
private image

> 1 AN ACT
>
> 2 To amend and reenact R.S. 14:283.2(A)(1) and to enact R.S.
> 14:283.2(C)(5), relative to the
>
> 3 nonconsensual disclosure of private images; to provide for elements
> of the offense;
>
> 4 to provide for a definition; and to provide for related matters.
>
> 5 Be it enacted by the Legislature of Louisiana:
>
> 6 Section 1. R.S. 14:283.2(A)(1) is hereby amended and reenacted and
> R.S.
>
> 7 14:283.2(C)(5) is hereby enacted to read as follows:
>
> 8 §283.2. Nonconsensual disclosure of a private image
>
> 9 A. A person commits the offense of nonconsensual disclosure of a
> private

10 image when all of the following occur:

11 (1) The person intentionally discloses an image of another person who
is

12 seventeen years of age or older, who is identifiable from the image
or information

13 displayed in connection with the image, and [who is either engaged in
a sexual]{.underline}

14 [performance or]{.underline} whose intimate parts are exposed in
whole or in part.

15 \* \* \*

16 C. For purposes of this Section:

17 \* \* \*

18 [(5) \""Sexual performance\"" means any performance or part thereof
that]{.underline}

> 19 [includes actual or simulated sexual intercourse, deviate sexual
> intercourse, sexual]{.underline}

Page 1 of 2

> CODING: Words in ~~struck through~~ type are deletions from existing
> law; words [underscored]{.underline}
>
> are additions.
>
> HLS 24RS-53 **[REENGROSSED]{.underline}** HB NO. 12

+-----------------------------------+-----------------------------------+
| 1\                                | > bestiality, masturbation,       |
| 2\                                | > sadomasochistic abuse, or lewd  |
| 3                                 | > exhibition of the genitals [or  |
|                                   | > anus.]{.underline}              |
|                                   |                                   |
|                                   | \* \* \*                          |
+===================================+===================================+
+-----------------------------------+-----------------------------------+

DIGEST

> The digest printed below was prepared by House Legislative Services.
> It constitutes no part of the legislative instrument. The keyword,
> one-liner, abstract, and digest do not constitute part of the law or
> proof or indicia of legislative intent. \[R.S. 1:13(B) and 24:177(E)\]

+-----------------------+-----------------------+-----------------------+
| HB 12 Reengrossed     | > 2024 Regular        | Jordan                |
|                       | > Session             |                       |
+=======================+=======================+=======================+
+-----------------------+-----------------------+-----------------------+

> **Abstract:** Amends the elements of nonconsensual disclosure of a
> private image and provides for a definition.
>
> [Present law]{.underline} provides for the crime of nonconsensual
> disclosure of a private image and provides for elements of the
> offense, penalties, and definitions.
>
> [Proposed law]{.underline} retains [present law]{.underline}.

[Present law]{.underline} provides that a person commits this offense
when all of the following occur:

> \(1\) The person intentionally discloses an image of another person
> who is 17 years of age or older, who is identifiable from the image or
> information displayed in connection with the image, and whose intimate
> parts are exposed in whole or in part.
>
> \(2\) The person who discloses the image obtained it under
> circumstances in which a reasonable person would know or understand
> that the image was to remain private.
>
> \(3\) The person who discloses the image knew or should have known
> that the person in the image did not consent to the disclosure of the
> image.
>
> \(4\) The person who discloses the image has the intent to harass or
> cause emotional distress to the person in the image, and the person
> who commits the offense knew or should have known that the disclosure
> could harass or cause emotional distress to the person in the image.
>
> [Proposed law]{.underline} retains [present law]{.underline}, but
> changes the element relative to the disclosure of an image of an
> identifiable person to encompass [either]{.underline} the exposing of
> intimate parts of [or]{.underline} the engaging in a sexual
> performance by the identifiable person.
>
> [Present law]{.underline} defines the terms \""criminal justice
> agency\"", \""disclosure\"", \""image\"", and \""intimate parts\"".
>
> [Proposed law]{.underline} retains [present law]{.underline} and
> provides a definition for \""sexual performance\"".
>
> (Amends R.S. 14:283.2(A)(1); Adds R.S. 14:283.2(C)(5))
>
> [The House Floor Amendments to the engrossed bill:]{.underline}
>
> 1\. Clarify the elements of [present law]{.underline} relative to the
> exposure of intimate parts or the engaging in a sexual performance by
> the identifiable person.

Page 2 of 2

> CODING: Words in ~~struck through~~ type are deletions from existing
> law; words [underscored]{.underline} are additions."
"""

# then ask Claude to classify it
message = client.messages.create(
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": prompt.format(text=legisation),
        }
    ],
    model="claude-3-haiku-20240307",
)

print(message.content[0].text)
<>:19: SyntaxWarning: invalid escape sequence '\*'
<>:19: SyntaxWarning: invalid escape sequence '\*'
C:\Users\mail\AppData\Local\Temp\ipykernel_7500\1971032123.py:19: SyntaxWarning: invalid escape sequence '\*'
  legisation = """
crime

Doing it one piece of text at a time isn’t particularly useful. You can use Python to read a spreadsheet of documents and classify them all at once.

Let’s read in the spreadsheet of bills from the Google Sheets exercise.

import pandas as pd

bills = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRly_QUcMdN_iIcwKdx6YZvGu8tuP9JU7DnCWUFT9nfLFloRzzxS8aSf4gTdKbU6kf47DFm05nVygrN/pub?gid=0&single=true&output=csv")
bills.to_csv("../data/bills.csv", index=False)
bills
bill_text ai_category about retirement?
0 > HLS 24RS-94 **[ENGROSSED]{.underline}**\n>\n... #NAME? NaN
1 > HLS 24RS-88 **[REENGROSSED]{.underline}**\n>... NaN NaN
2 > HLS 24RS-53 **[REENGROSSED]{.underline}**\n>... NaN NaN
3 > 2024 Regular Session **[ENROLLED]{.underline... NaN NaN
4 > 2024 Regular Session **[ENROLLED]{.underline... NaN NaN
5 > HLS 24RS-1606 **[ORIGINAL]{.underline}**\n>\... NaN NaN
6 > HLS 24RS-2151 **[ORIGINAL]{.underline}**\n>\... NaN NaN
7 > HLS 24RS-1646 **[ENGROSSED]{.underline}**\n>... NaN NaN
8 > HLS 24RS-1553 **[ORIGINAL]{.underline}**\n>\... NaN NaN

We’re now going to write a function that takes a piece of text and classifies it using Claude.

# cache results to avoid having to reclassify
from joblib import Memory
memory = Memory("cachedir", verbose=0)
@memory.cache

# define the function
def classify(row):
    prompt = """
    Below is the text to a piece of legislation. Classify it as one of the following categories:

    - environment
    - taxes
    - school
    - crime
    - other

    Only provide the category name in your response. Use only lowercase letters.

    Bill text:

    {text}
    """
    
    message = client.messages.create(
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": prompt.format(text=row['bill_text']),
            }
        ],
        model="claude-3-haiku-20240307",
        temperature=0,
    )

    return pd.Series({
        'content': message.content[0].text
    })

Now, let’s apply this function to our bills dataframe.

# run the function
bills['ai_category'] = bills.apply(classify, axis=1)

# save the results
bills.to_csv("../data/bills-classified.csv", index=False)

# print the results
bills
bill_text ai_category about retirement?
0 > HLS 24RS-94 **[ENGROSSED]{.underline}**\n>\n... school NaN
1 > HLS 24RS-88 **[REENGROSSED]{.underline}**\n>... crime NaN
2 > HLS 24RS-53 **[REENGROSSED]{.underline}**\n>... crime NaN
3 > 2024 Regular Session **[ENROLLED]{.underline... other NaN
4 > 2024 Regular Session **[ENROLLED]{.underline... environment NaN
5 > HLS 24RS-1606 **[ORIGINAL]{.underline}**\n>\... school NaN
6 > HLS 24RS-2151 **[ORIGINAL]{.underline}**\n>\... other NaN
7 > HLS 24RS-1646 **[ENGROSSED]{.underline}**\n>... crime NaN
8 > HLS 24RS-1553 **[ORIGINAL]{.underline}**\n>\... health NaN

As you can see, we now have a classified dataset of bills.