Why GIS data needs better AI defaults

A year of cleaning maps for models — and the tiny choices that broke them. Notes for anyone who has stared at a CRS error at midnight.

01 / coordinates — The world is round, your CSV is not.

Every dataset arrives in some coordinate reference system. Most of them lie about it. The default — assume EPSG:4326, hope for the best — works in tutorials and breaks in production.

# what I do now, every time
import geopandas as gpd
gdf = gpd.read_file("layer.geojson")
assert gdf.crs is not None, "missing CRS — refuse the data"
gdf = gdf.to_crs(4326)

If a layer doesn't declare its CRS, it shouldn't load. That single rule saved me three weeks.

02 / nulls — Missing data has shape.

Models don't know that 0 for elevation means "ocean" in one column and "no reading" in another. The AI defaults — fill with mean, fill with zero — both lie. I've started shipping a tiny missing_kind channel alongside features. It's ugly. It works.

03 / scale — Resolution is a feature, not a setting.

You can train a model on 30m Landsat and apply it to 10m Sentinel and the loss goes down. It also goes wrong, quietly. Resolution leaks into the model as texture. The fix isn't more data — it's labeling the resolution and letting the model condition on it.

thanks for reading
copy linkshare ↗
Read next
$ prompt --strict
Article · 4 min

Notes on prompting like an engineer

Specs over vibes. Tests over hopes.

CV.pdf
Resume / CV

What I've worked on

Six years across IT, GIS, and AI. Updated quarterly.