Class Notes: Data Handling Using Pandas – I
What is Pandas?
Pandas is a super useful Python library for working with data—like a supercharged Excel for coders. It helps you:
- Organize data neatly (like tables).
- Clean messy data (handling missing values, duplicates, etc.).
- Slice, filter, and analyze data quickly.
Pandas Data Structures
Pandas has two main data containers:
- Series: A single column of data with labels (like a list with a name tag).
- DataFrame: A full table with rows and columns (like an Excel sheet).
1. Series
- What? A 1D labeled array (e.g.,
[10, 15, 18]
with index labels[0, 1, 2]
). - Features:
- Data can be changed (mutable), but size can’t (immutable).
- Index labels make data easy to access.
How to Create a Series:
python
Copy
import pandas as pd data = [10, 15, 18, 22] s = pd.Series(data, index=['a', 'b', 'c', 'd']) print(s)
Output:
Copy
a 10 b 15 c 18 d 22
Cool Tricks with Series:
- Math ops:
s * 2
(multiplies all values by 2). - Filtering:
s[s > 2]
(shows values > 2). - Head/Tail:
s.head(3)
(first 3 rows) ors.tail(2)
(last 2 rows).
2. DataFrame
- What? A 2D table (rows + columns).
- Features:
- Columns can hold different data types (numbers, text, etc.).
- Size and data can be changed (mutable).
How to Create a DataFrame:
python
Copy
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]} df = pd.DataFrame(data) print(df)
Output:
Copy
Name Age 0 Alice 25 1 Bob 30
Working with DataFrames:
- Add a column:
df['Salary'] = [5000, 6000]
- Delete a column:
del df['Age']
ordf.drop('Age', axis=1)
- Select data:
- Single column:
df['Name']
- Multiple columns:
df[['Name', 'Salary']]
- Rows:
df.loc[0:2]
(by label) ordf.iloc[0:2]
(by position)
- Single column:
DataFrame Operations
- Filtering:
df[df['Age'] > 25]
(people older than 25). - Math:
df['Salary'].sum()
(total salary). - Merge/Join: Combine two DataFrames (like SQL joins).pythonCopydf1.merge(df2, on=’ID’, how=’inner’) # Keeps matching rows only.
Reading/Writing CSV Files
- Read CSV:
data = pd.read_csv('file.csv')
- Save to CSV:
df.to_csv('new_file.csv')
Key Takeaways
- Series = 1D labeled data (single column).
- DataFrame = 2D table (rows + columns).
- Use
loc
/iloc
to access data. - Pandas is your best friend for data cleaning, analysis, and quick lookups!