Exploring Data with Pandas

In this notebook you’ll learn how to work with structured data using Pandas. This is one of the most widely used tools in data science and AI.

Open In Jupyter K12

What You’ll Learn

  • What a DataFrame is and how to create one
  • How to inspect data with head(), describe(), and shape
  • How to filter rows to find specific data
  • How to sort a DataFrame by any column
  • How to load a CSV file into Pandas

Part 1: Hello, Pandas!

Pandas is a Python library for working with structured data, such as spreadsheets or tables.

The key building block is a DataFrame. Think of a DataFrame as a table with rows and columns, just like a spreadsheet.

Let’s create a DataFrame using average monthly temperatures in New York City:

import pandas as pd

data = {
  'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
         'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
  'avg_temp_c': [0, 1, 6, 12, 18, 23, 26, 25, 21, 15, 9, 3]
}

df = pd.DataFrame(data)
df

What is a Series?

A Series is a single column of data, like one column pulled from your DataFrame.

You can access a column by name using square brackets []. Let’s look at just the temperature column:

temperatures = df['avg_temp_c']
print(temperatures)

Exploring Your Data

Pandas has some handy methods for getting a quick look at your data:

  • df.head() — shows the first 5 rows
  • df.describe() — shows summary statistics like min, max, and average
  • df.shape — tells you how many rows and columns you have

Run each cell below to see what they do:

df.head()
df.describe()
print(f"This dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

What did you notice from df.describe()? What was the hottest average month? What was the coldest?

Part 2: Filtering Data

One of Pandas’ most useful features is filtering, which enables you to select only the rows that match a condition.

The syntax looks like this:

df[df['column_name'] > value]

Think of it as asking: “Give me all the rows where this column meets this condition.”

Let’s find all months where the average temperature is above 15°C:

warm_months = df[df['avg_temp_c'] > 15]
warm_months

You can use any comparison operator:

  • > greater than
  • < less than
  • == equal to (note the double equals!)
  • >= greater than or equal to
# ✏️ YOUR TURN: find months where the temperature is below 5°C
# Try changing the number, or flip the > to < to find cold months

cold_months = df[df['avg_temp_c'] < 5]
cold_months

Part 3: Sorting Data

You can sort a DataFrame by any column using df.sort_values().

By default it sorts from smallest to largest. Add ascending=False to flip the order:

# Sort months from warmest to coldest
df.sort_values('avg_temp_c', ascending=False)
# YOUR TURN: sort from coldest to warmest
# Hint: change ascending=False to ascending=True

df.sort_values('avg_temp_c', ascending=False)

Part 4: Loading from a CSV

So far you’ve created DataFrames by typing the data directly into Python. In the real world, data usually lives in a file, most commonly a CSV (comma-separated values) file.

A CSV file looks like this:

month,avg_temp_c,precipitation_mm,sunny_days
Jan,0,94,8
Feb,1,88,9
...

Each row is a new line and values are separated by commas. Loading it into Pandas takes just one line:

df = pd.read_csv('nyc_weather.csv')

Since we’re working in the browser, we’ll load it from the web, but the idea is exactly the same:

import httpx
import io

url = "https://raw.githubusercontent.com/simonguest/codercub/main/labs/02/notebooks/nyc_weather.csv"

response = httpx.get(url)
df_weather = pd.read_csv(io.StringIO(response.text))
df_weather
# What new columns does the CSV have?
df_weather.describe()
# YOUR TURN: try filtering or sorting by one of the new columns
# For example: find months with more than 100mm of precipitation

wet_months = df_weather[df_weather['precipitation_mm'] > 100]
wet_months

What new columns does the CSV have that our original DataFrame didn’t? Try filtering or sorting by one of them — what did you find?

Check Your Understanding

{ “question_type”: “multiple_choice”, “question”: “What is a Pandas DataFrame?”, “options”: [ { “key”: “a”, “text”: “A single column of data” }, { “key”: “b”, “text”: “A table of data with rows and columns” }, { “key”: “c”, “text”: “A type of chart” }, { “key”: “d”, “text”: “A Python function for reading files” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “Which Pandas method shows the first 5 rows of a DataFrame?”, “options”: [ { “key”: “a”, “text”: “df.show()” }, { “key”: “b”, “text”: “df.top()” }, { “key”: “c”, “text”: “df.first()” }, { “key”: “d”, “text”: “df.head()” } ], “answer”: “d”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “Which of the following correctly filters a DataFrame to rows where ‘score’ is greater than 80?”, “options”: [ { “key”: “a”, “text”: “df.filter(‘score’ > 80)” }, { “key”: “b”, “text”: “df[‘score’ > 80]” }, { “key”: “c”, “text”: “df[df[‘score’] > 80]” }, { “key”: “d”, “text”: “df.where(score > 80)” } ], “answer”: “c”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “Calling df.sort_values() permanently changes the order of the original DataFrame.”, “answer”: “False”, “submitted_answer”: “” }