AFL Tables Scraper & Player Stats Pipeline
This project is a data-engineering pipeline that turns raw HTML from AFL Tables into clean, ML-ready datasets for:
- Match-level results (scores, rounds, venues, margins)
- Player-game stats (disposals, goals, tackles, ruck work, etc.)
The goal is to build a solid data foundation for modelling, forecasting and exploratory analysis of AFL performance.
AFL Tables scraper
Key Outcomes
The scraper:
- Caches raw HTML locally so I can iterate without hammering AFL Tables
- Parses season pages into a structured match dataset
- Follows each game’s “Match stats” link to pull full player box scores
- Exports clean CSVs ready for pandas / notebooks / ML experiments
Problem I Wanted to Solve
If you want to do anything serious with AFL data, you immediately hit a few issues:
-
Everything lives in HTML tables
AFL Tables is an incredible resource, but not built for programmatic access. -
Different pages = different structures
- Season pages: rounds, two rows per match, result text, match stats link
- Match pages: separate tables for each team with dense per-player stats
-
I want reproducible datasets, not manual exports
The aim is to treat the data as a versioned dataset regenerated by code, not spreadsheets and copy–paste.
Data Model
Matches table (examples):
match_id– stable key:{season}_{date}_{home_team}_{away_team}season,round,datehome_team,away_team- Quarter-by-quarter scores (
home_q1–home_q4,away_q1–away_q4) home_points,away_points,margin,winner(home/away/draw)venue,attendancestats_path– relative URL to the match stats page
Player-games table (examples):
match_id,season,dateteam,opponent,is_homejumper,player_namekicks,marks,handballs,disposals,goals,behindshitouts,tackles,clearances,inside_50scontested_possessions,uncontested_possessionsgoal_assists,one_percenters,percent_time
This schema makes it trivial to join matches ↔ player_games and build features for modelling.
Implementation
1. Config & Fetch
I started by wrapping the scraper settings in a small config object:
base_urlfor AFL Tablescache_dirfor raw HTML (data/cache)delaybetween HTTP requests to be polite to the site
The fetch layer:
- Builds season URLs like
.../seas/2025.html - Downloads and caches HTML on first run
- Reads from disk on subsequent runs, so I can repeatedly refine the parser without hitting the network
2. Season Pages → Match Dataset
Season pages encode each match as two rows:
- Row 1 (home): team, quarter scores, total points, date/time, attendance, venue
- Row 2 (away): team, quarter scores, total points, result text,
[Match stats]link
The parser:
- Walks all
<tr>rows with BeautifulSoup - Detects round headings (e.g. “Round 1”, “Qualifying Final”) and tracks
current_round - Uses a regex to extract:
- Team name
- Quarter scores (
q1–q4) - Total points
- A “tail” segment after the score (which might contain date/Att/Venue or result text)
- Pulls:
date_raw,attendance,venuefrom the first row’s tail- The
stats_pathlink from the second row (the one with[Match stats])
Once it has two team rows buffered, it:
- Computes the margin and winner
- Parses the date into a proper
dateobject where possible - Constructs a deterministic
match_id - Emits a single match record, later written to CSV
3. Match Stats Pages → Player-Games
For each match that has a stats_path, a second script:
- Builds the full stats URL (e.g.
https://www.afltables.com/afl/stats/games/2025/...) - Fetches + caches the match stats HTML
- Parses each team table using the header row (
<thead>) to map AFL shorthand to internal names
I use a central mapping like:
// in config
{
"KI": "kicks",
"MK": "marks",
"HB": "handballs",
"DI": "disposals",
"GL": "goals",
"BH": "behinds",
"HO": "hitouts",
"TK": "tackles",
"CL": "clearances",
"CP": "contested_possessions",
"UP": "uncontested_possessions",
"GA": "goal_assists",
"%P": "percent_time"
}
Each <tr> for a player is normalised into:
- Clean numeric fields (blank cells coerced to 0 where appropriate)
- Context columns like match_id, season, date, team, opponent, is_home, venue
All of these rows are appended into a single player_games CSV.
Why this matters
This turns AFL Tables from a manual, page-by-page resource into a reproducible dataset I can use for modelling, visualisation, and analysis.
With this pipeline, I can now:
- Run exploratory data analysis across seasons
- Build features like rolling form, venue effects, and player usage profiles
- Prototype models for match outcomes or player stat lines (for example, disposal or goal totals)
What I Learned
- How to design HTML scrapers that are robust but not brittle, using a mix of BeautifulSoup and regex
- The value of caching and small, composable scripts when iterating on web data
- How to structure data so it’s friendly to:
- SQL-style joins
- pandas workflows
- Downstream ML / analytics
This project sits at the intersection of sports, data engineering, and analytics, and gives me a reusable foundation for any AFL-related modelling I want to explore next.