Why GIS data needs better AI defaults
A year of cleaning maps for models — and the tiny choices that broke them. Notes for anyone who has stared at a CRS error at midnight.
01 / coordinates — The world is round, your CSV is not.
Every dataset arrives in some coordinate reference system. Most of them lie about it. The default — assume EPSG:4326, hope for the best — works in tutorials and breaks in production.
# what I do now, every time
import geopandas as gpd
gdf = gpd.read_file("layer.geojson")
assert gdf.crs is not None, "missing CRS — refuse the data"
gdf = gdf.to_crs(4326)
If a layer doesn't declare its CRS, it shouldn't load. That single rule saved me three weeks.
02 / nulls — Missing data has shape.
Models don't know that 0 for elevation means "ocean" in one column and "no reading" in another. The AI defaults — fill with mean, fill with zero — both lie. I've started shipping a tiny missing_kind channel alongside features. It's ugly. It works.
03 / scale — Resolution is a feature, not a setting.
You can train a model on 30m Landsat and apply it to 10m Sentinel and the loss goes down. It also goes wrong, quietly. Resolution leaks into the model as texture. The fix isn't more data — it's labeling the resolution and letting the model condition on it.