The Dzud — Teaching a Machine to Predict Disaster

By the end of this notebook you will have trained two models, compared their accuracy, and used the better one to predict what would happen to a Mongolian herding family given a specific forecast of winter temperature and summer drought.

Step 1: Load the data and imports

We are going to train the machine using scikit-learn, the most widely used machine learning library in Python. We will use just two things from it: a LinearRegression model and a function to split data into training and test sets.

import httpx
import io
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

url = "https://raw.githubusercontent.com/simonguest/codercub/main/labs/03/notebooks/mongolia_dzud_1990_2013.csv"
response = httpx.get(url)
df = pd.read_csv(io.StringIO(response.text))

print(f"Loaded {len(df)} rows")
df.head()

Step 2: The idea behind linear regression

Before writing any model code, it is worth understanding what linear regression actually does.

In the previous notebook you found that colder winters tend to produce higher mortality. If you drew a straight line through that scatter plot that best fits the data, you would have a linear regression model. Given a new winter temperature value, the line tells you the predicted mortality.

The line has two properties: - A slope — how much mortality changes for each one-degree change in temperature - An intercept — where the line crosses the y-axis (the predicted mortality at 0°C)

A model with two input variables (temperature and drought) fits a flat plane through the data instead of a line — the same idea, one dimension higher.

Before training the model: if winter temperature drops by 5°C, what do you predict happens to mortality? What if the summer drought index rises from 0.2 to 0.8? Write your prediction — you can check it against the model’s coefficients later.

Step 3: Prepare the data

Before training, we need to decide on three things: 1. Which columns are our inputs (features) — the variables the model will learn from 2. Which column is our output (target) — the value the model will try to predict 3. How to split the data into a training set (data the model learns from) and a test set (data we hold back to evaluate it)

Keeping the test set hidden during training is critical. If we evaluated the model on data it had already seen, we would have no idea how well it generalizes to genuinely new situations.

# Input features and target
X = df[["winter_temp_c"]]          # start with one variable only
y = df["mortality_pct"]

# Split: 80% for training, 20% held back for testing
# random_state=42 ensures everyone gets the same split
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

print(f"Training rows: {len(X_train)}")
print(f"Test rows:     {len(X_test)}")
print(f"\nThe model will learn from {len(X_train)} aimag-years")
print(f"and be evaluated on {len(X_test)} aimag-years it has never seen")

Step 4: Train the first model: temperature only

Three lines of code to train a model. That is all scikit-learn requires.

The fit call is where the learning happens — it finds the slope and intercept that best describes the relationship between winter temperature and mortality in the training data.

model_1 = LinearRegression()
model_1.fit(X_train, y_train)

print(f"Model learned:")
print(f"  Slope (per 1°C change in winter temp): {model_1.coef_[0]:.3f}")
print(f"  Intercept:                             {model_1.intercept_:.3f}")
print()
print(f"Interpretation: for every 1°C colder the winter is,")
print(f"the model predicts mortality increases by {abs(model_1.coef_[0]):.2f} percentage points")

The model learned a specific slope — how much mortality increases per degree of cold. Does that number feel intuitively reasonable based on the scatter plots you saw earlier? Is it larger or smaller than you expected?

Step 5: Evaluate the first model

Training a model is straightforward. The harder question is: how good is it?

We evaluate on the test set — the 20% of data the model has never seen.

We use MAE (Mean Absolute Error): on average, how many percentage points is the model’s prediction off from the real value? A lower MAE means the model is more accurate.

y_pred_1 = model_1.predict(X_test)

mae_1 = mean_absolute_error(y_test, y_pred_1)

print(f"Model 1 — temperature only")
print(f"  MAE: {mae_1:.2f} percentage points average error")

The MAE tells you something concrete — on average, the model’s prediction is off by that many percentage points.

For context, in a year like 2001 when average national mortality was around 23%, an MAE of several percentage points represents a meaningful margin of error for a herder trying to decide whether to move their animals or seek insurance support.

Step 6: Train the second model with both temperature and drought

Now we add the summer drought index as a second input feature. The change to the code is minimal — we just add the second column to X. Scikit-learn handles the rest.

# Two input features this time
X2 = df[["winter_temp_c", "summer_drought_idx"]]

X2_train, X2_test, y2_train, y2_test = train_test_split(
  X2, y, test_size=0.2, random_state=42   # same random_state = same split
)

model_2 = LinearRegression()
model_2.fit(X2_train, y2_train)

print(f"Model 2 learned:")
print(f"  Coefficient for winter_temp_c:        {model_2.coef_[0]:.3f}")
print(f"  Coefficient for summer_drought_idx:   {model_2.coef_[1]:.3f}")
print(f"  Intercept:                            {model_2.intercept_:.3f}")
print()
print(f"Interpretation:")
print(f"  Each 1°C colder  → mortality increases by {abs(model_2.coef_[0]):.2f} pp")
print(f"  Each 0.1 increase in drought index → mortality increases by {model_2.coef_[1]*0.1:.2f} pp")

Step 7: Evaluate the second model

Let’s measure whether adding drought genuinely improves the model.

y_pred_2 = model_2.predict(X2_test)

mae_2 = mean_absolute_error(y2_test, y_pred_2)

print(f"Model 1 — temperature only")
print(f"  MAE: {mae_1:.2f} percentage points average error")
print()
print(f"Model 2 — temperature + drought")
print(f"  MAE: {mae_2:.2f} percentage points average error")
print()
mae_improvement = mae_1 - mae_2
print(f"Adding drought index reduced the average error by {mae_improvement:.2f} percentage points")

The MAE should decrease when drought is added — meaning the model’s predictions are closer to the real values on average. A lower MAE means a more useful model.

Was the improvement from adding drought larger or smaller than you expected? Does it change your view of how important the summer drought variable is compared to winter temperature?

Step 8: Make a prediction

The model is now a tool. Given a forecast of winter temperature and summer drought, it can output a predicted mortality percentage — before the winter even arrives.

Let’s use it to answer a concrete question: what does the model predict for three different scenarios a Mongolian herder might face?

scenarios = pd.DataFrame({
  "scenario":          ["Mild winter, no drought",
              "Cold winter, no drought",
              "Cold winter, severe drought"],
  "winter_temp_c":     [-16.0, -24.0, -24.0],
  "summer_drought_idx": [ 0.10,  0.15,  0.75]
})

scenarios["predicted_mortality_pct"] = model_2.predict(
  scenarios[["winter_temp_c", "summer_drought_idx"]]
).round(1)

print(scenarios[["scenario", "winter_temp_c", "summer_drought_idx",
         "predicted_mortality_pct"]].to_string(index=False))

Look at the difference between the second and third scenarios — the same cold winter, but one is preceded by drought and one is not.

Now try building your own scenario!

my_winter_temp = -20 #@param {type:"slider", min:-30, max:-10, step:1}
my_drought = 0.4 #@param {type:"slider", min:0.0, max:1.0, step:0.05}
herd_size = 2000 #@param {type:"slider", min:500, max:5000, step:500}

my_scenario = pd.DataFrame({
  "winter_temp_c":      [my_winter_temp],
  "summer_drought_idx": [my_drought]
})

prediction = model_2.predict(my_scenario)[0]
print(f"Winter temperature:   {my_winter_temp}°C")
print(f"Summer drought index: {my_drought}")
print(f"Predicted mortality:  {prediction:.1f}%")
print()

predicted_losses = int(herd_size * prediction / 100)
print(f"For a herder with {herd_size:,} animals,")
print(f"that is approximately {predicted_losses:,} animals lost.")

Check Your Understanding

{ “question_type”: “multiple_choice”, “question”: “Why do we keep a test set separate from the training data?”, “options”: [ { “key”: “a”, “text”: “To make the model train faster” }, { “key”: “b”, “text”: “To evaluate how well the model generalizes to data it has never seen” }, { “key”: “c”, “text”: “To reduce the total amount of data needed” }, { “key”: “d”, “text”: “To improve the slope calculation” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “Model 1 has an MAE of 4.2 and Model 2 has an MAE of 3.1. Which model is better?”, “options”: [ { “key”: “a”, “text”: “Model 1, because a higher MAE means it learned more” }, { “key”: “b”, “text”: “Model 2, because a lower MAE means smaller prediction errors” }, { “key”: “c”, “text”: “They are the same — the difference is too small to matter” }, { “key”: “d”, “text”: “Model 1, because it uses fewer variables” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “A model with a lower MAE is making predictions that are, on average, closer to the actual values.”, “answer”: “True”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “Why did Model 2 (temperature + drought) outperform Model 1 (temperature only)?”, “options”: [ { “key”: “a”, “text”: “It was trained on more rows of data” }, { “key”: “b”, “text”: “Drought adds independent information that temperature alone does not capture” }, { “key”: “c”, “text”: “Temperature was given less weight in Model 2” }, { “key”: “d”, “text”: “Model 2 trained for a longer time” } ], “answer”: “b”, “submitted_answer”: “” }