← Back to projects

Historic AFL Data Webscraper

Python script to gather historic AFL data from afltables.

AFL Tables Scraper & Player Stats Pipeline

This project is a data-engineering pipeline that turns raw HTML from AFL Tables into clean, ML-ready datasets for:

  • Match-level results (scores, rounds, venues, margins)
  • Player-game stats (disposals, goals, tackles, ruck work, etc.)

The goal is to build a solid data foundation for modelling, forecasting and exploratory analysis of AFL performance.

AFL Tables scraper

Stack
Python, requests, BeautifulSoup, pandas, dataclasses
Domain
AFL match & player statistics from AFL Tables
Outputs
Matches CSV + Player-games CSV
Focus
Robust parsing, caching & schema design

Key Outcomes

Match rows
1 per game
Player-game rows
1 per player per game
Seasons
Configurable (e.g. 2025 → all years later)

The scraper:

  • Caches raw HTML locally so I can iterate without hammering AFL Tables
  • Parses season pages into a structured match dataset
  • Follows each game’s “Match stats” link to pull full player box scores
  • Exports clean CSVs ready for pandas / notebooks / ML experiments

Problem I Wanted to Solve

If you want to do anything serious with AFL data, you immediately hit a few issues:

  1. Everything lives in HTML tables
    AFL Tables is an incredible resource, but not built for programmatic access.

  2. Different pages = different structures

    • Season pages: rounds, two rows per match, result text, match stats link
    • Match pages: separate tables for each team with dense per-player stats
  3. I want reproducible datasets, not manual exports
    The aim is to treat the data as a versioned dataset regenerated by code, not spreadsheets and copy–paste.

Data Model

High-level view: matches joined to player-games on match_id

Matches table (examples):

  • match_id – stable key: {season}_{date}_{home_team}_{away_team}
  • season, round, date
  • home_team, away_team
  • Quarter-by-quarter scores (home_q1home_q4, away_q1away_q4)
  • home_points, away_points, margin, winner (home / away / draw)
  • venue, attendance
  • stats_path – relative URL to the match stats page

Player-games table (examples):

  • match_id, season, date
  • team, opponent, is_home
  • jumper, player_name
  • kicks, marks, handballs, disposals, goals, behinds
  • hitouts, tackles, clearances, inside_50s
  • contested_possessions, uncontested_possessions
  • goal_assists, one_percenters, percent_time

This schema makes it trivial to join matchesplayer_games and build features for modelling.

Implementation

1. Config & Fetch

I started by wrapping the scraper settings in a small config object:

  • base_url for AFL Tables
  • cache_dir for raw HTML (data/cache)
  • delay between HTTP requests to be polite to the site

The fetch layer:

  • Builds season URLs like .../seas/2025.html
  • Downloads and caches HTML on first run
  • Reads from disk on subsequent runs, so I can repeatedly refine the parser without hitting the network

2. Season Pages → Match Dataset

Season pages encode each match as two rows:

  • Row 1 (home): team, quarter scores, total points, date/time, attendance, venue
  • Row 2 (away): team, quarter scores, total points, result text, [Match stats] link

The parser:

  • Walks all <tr> rows with BeautifulSoup
  • Detects round headings (e.g. “Round 1”, “Qualifying Final”) and tracks current_round
  • Uses a regex to extract:
    • Team name
    • Quarter scores (q1q4)
    • Total points
    • A “tail” segment after the score (which might contain date/Att/Venue or result text)
  • Pulls:
    • date_raw, attendance, venue from the first row’s tail
    • The stats_path link from the second row (the one with [Match stats])

Once it has two team rows buffered, it:

  • Computes the margin and winner
  • Parses the date into a proper date object where possible
  • Constructs a deterministic match_id
  • Emits a single match record, later written to CSV

3. Match Stats Pages → Player-Games

For each match that has a stats_path, a second script:

  1. Builds the full stats URL (e.g. https://www.afltables.com/afl/stats/games/2025/...)
  2. Fetches + caches the match stats HTML
  3. Parses each team table using the header row (<thead>) to map AFL shorthand to internal names

I use a central mapping like:

// in config
{
  "KI": "kicks",
  "MK": "marks",
  "HB": "handballs",
  "DI": "disposals",
  "GL": "goals",
  "BH": "behinds",
  "HO": "hitouts",
  "TK": "tackles",
  "CL": "clearances",
  "CP": "contested_possessions",
  "UP": "uncontested_possessions",
  "GA": "goal_assists",
  "%P": "percent_time"
}

Each <tr> for a player is normalised into:

  • Clean numeric fields (blank cells coerced to 0 where appropriate)
  • Context columns like match_id, season, date, team, opponent, is_home, venue

All of these rows are appended into a single player_games CSV.

Why this matters

Impact

This turns AFL Tables from a manual, page-by-page resource into a reproducible dataset I can use for modelling, visualisation, and analysis.

With this pipeline, I can now:

  • Run exploratory data analysis across seasons
  • Build features like rolling form, venue effects, and player usage profiles
  • Prototype models for match outcomes or player stat lines (for example, disposal or goal totals)

What I Learned

  • How to design HTML scrapers that are robust but not brittle, using a mix of BeautifulSoup and regex
  • The value of caching and small, composable scripts when iterating on web data
  • How to structure data so it’s friendly to:
  • SQL-style joins
  • pandas workflows
  • Downstream ML / analytics

This project sits at the intersection of sports, data engineering, and analytics, and gives me a reusable foundation for any AFL-related modelling I want to explore next.