AFL Tables Scraper & Player Stats Pipeline

This project is a data-engineering pipeline that turns raw HTML from AFL Tables into clean, ML-ready datasets for:

Match-level results (scores, rounds, venues, margins)
Player-game stats (disposals, goals, tackles, ruck work, etc.)

The goal is to build a solid data foundation for modelling, forecasting and exploratory analysis of AFL performance.

AFL Tables scraper

Stack

Python, requests, BeautifulSoup, pandas, dataclasses

Domain

AFL match & player statistics from AFL Tables

Outputs

Matches CSV + Player-games CSV

Focus

Robust parsing, caching & schema design

Key Outcomes

Match rows

1 per game

Player-game rows

1 per player per game

Seasons

Configurable (e.g. 2025 → all years later)

The scraper:

Caches raw HTML locally so I can iterate without hammering AFL Tables
Parses season pages into a structured match dataset
Follows each game’s “Match stats” link to pull full player box scores
Exports clean CSVs ready for pandas / notebooks / ML experiments

Problem I Wanted to Solve

If you want to do anything serious with AFL data, you immediately hit a few issues:

Everything lives in HTML tables
AFL Tables is an incredible resource, but not built for programmatic access.
Different pages = different structures
- Season pages: rounds, two rows per match, result text, match stats link
- Match pages: separate tables for each team with dense per-player stats
I want reproducible datasets, not manual exports
The aim is to treat the data as a versioned dataset regenerated by code, not spreadsheets and copy–paste.

Data Model

High-level view: matches joined to player-games on match_id

Matches table (examples):

match_id – stable key: {season}_{date}_{home_team}_{away_team}
season, round, date
home_team, away_team
Quarter-by-quarter scores (home_q1–home_q4, away_q1–away_q4)
home_points, away_points, margin, winner (home / away / draw)
venue, attendance
stats_path – relative URL to the match stats page

Player-games table (examples):

match_id, season, date
team, opponent, is_home
jumper, player_name
kicks, marks, handballs, disposals, goals, behinds
hitouts, tackles, clearances, inside_50s
contested_possessions, uncontested_possessions
goal_assists, one_percenters, percent_time

This schema makes it trivial to join matches ↔ player_games and build features for modelling.

Implementation

1. Config & Fetch

I started by wrapping the scraper settings in a small config object:

base_url for AFL Tables
cache_dir for raw HTML (data/cache)
delay between HTTP requests to be polite to the site

The fetch layer:

Builds season URLs like .../seas/2025.html
Downloads and caches HTML on first run
Reads from disk on subsequent runs, so I can repeatedly refine the parser without hitting the network

2. Season Pages → Match Dataset

Season pages encode each match as two rows:

Row 1 (home): team, quarter scores, total points, date/time, attendance, venue
Row 2 (away): team, quarter scores, total points, result text, [Match stats] link

The parser:

Walks all <tr> rows with BeautifulSoup
Detects round headings (e.g. “Round 1”, “Qualifying Final”) and tracks current_round
Uses a regex to extract:
- Team name
- Quarter scores (q1–q4)
- Total points
- A “tail” segment after the score (which might contain date/Att/Venue or result text)
Pulls:
- date_raw, attendance, venue from the first row’s tail
- The stats_path link from the second row (the one with [Match stats])

Once it has two team rows buffered, it:

Computes the margin and winner
Parses the date into a proper date object where possible
Constructs a deterministic match_id
Emits a single match record, later written to CSV

3. Match Stats Pages → Player-Games

For each match that has a stats_path, a second script:

Builds the full stats URL (e.g. https://www.afltables.com/afl/stats/games/2025/...)
Fetches + caches the match stats HTML
Parses each team table using the header row (<thead>) to map AFL shorthand to internal names

I use a central mapping like:

// in config
{
  "KI": "kicks",
  "MK": "marks",
  "HB": "handballs",
  "DI": "disposals",
  "GL": "goals",
  "BH": "behinds",
  "HO": "hitouts",
  "TK": "tackles",
  "CL": "clearances",
  "CP": "contested_possessions",
  "UP": "uncontested_possessions",
  "GA": "goal_assists",
  "%P": "percent_time"
}

Each <tr> for a player is normalised into:

Clean numeric fields (blank cells coerced to 0 where appropriate)
Context columns like match_id, season, date, team, opponent, is_home, venue

All of these rows are appended into a single player_games CSV.

Why this matters

Impact

This turns AFL Tables from a manual, page-by-page resource into a reproducible dataset I can use for modelling, visualisation, and analysis.

With this pipeline, I can now:

Run exploratory data analysis across seasons
Build features like rolling form, venue effects, and player usage profiles
Prototype models for match outcomes or player stat lines (for example, disposal or goal totals)

What I Learned

How to design HTML scrapers that are robust but not brittle, using a mix of BeautifulSoup and regex
The value of caching and small, composable scripts when iterating on web data
How to structure data so it’s friendly to:
SQL-style joins
pandas workflows
Downstream ML / analytics

This project sits at the intersection of sports, data engineering, and analytics, and gives me a reusable foundation for any AFL-related modelling I want to explore next.

Historic AFL Data Webscraper

AFL Tables Scraper & Player Stats Pipeline