Machine learning is a powerful field that combines computer science, statistics, and mathematics to make predictions and discover patterns in data using algorithms and models.
Machine learning uses algorithms and models to:
Key Components:
| Component | Definition | Example |
|---|---|---|
| Model 🧮 | A mathematical function describing relationships between inputs and outputs | Linear regression equation: y = mx + b |
| Algorithm ⚙️ | A procedure or set of decision rules to carry out ML tasks | Decision tree rules, gradient descent |
| Dataset 📁 | Collection of information containing features and instances | Rideshare trip data with prices, distances |
┌─────────────────────────────────────────────────┐ │ DATAFRAME │ ├──────────┬──────────┬──────────┬──────────┬─────┤ │ Feature1 │ Feature2 │ Feature3 │ Feature4 │ ... │ ← FEATURES (Columns) ├──────────┼──────────┼──────────┼──────────┼─────┤ │ 1.30 │ Uber │ 2018-... │ Theatre │ ... │ ← INSTANCE (Row 1) ├──────────┼──────────┼──────────┼──────────┼─────┤ │ 1.35 │ Lyft │ 2018-... │ South │ ... │ ← INSTANCE (Row 2) ├──────────┼──────────┼──────────┼──────────┼─────┤ │ 1.10 │ Lyft │ 2018-... │Financial │ ... │ ← INSTANCE (Row 3) └──────────┴──────────┴──────────┴──────────┴─────┘
| Term | Definition | Visual Representation |
|---|---|---|
| Instance 📍 | Individual data point or observational unit (ROW) | Each rideshare trip |
| Feature 🏷️ | Characteristic measured on an instance (COLUMN) | Distance, price, cab_type |
| Dataset 📦 | Collection of instances and features | Complete rideshare data table |
Common Import Functions:
| Function | Purpose | File Type |
|---|---|---|
| pd.read_csv() | Import CSV files | .csv |
| pd.read_excel() | Import Excel files | .xlsx, .xls |
| pd.read_json() | Import JSON files | .json |
| Function/Method | Purpose | Syntax Example | Returns |
|---|---|---|---|
| pd.read_csv() | Load CSV file | pd.read_csv('file.csv') | DataFrame |
| dataframe[['col']] | Select column(s) | df[['distance']] | DataFrame |
| dataframe.iloc[x, y] | Select by position | df.iloc[0][1] | Element/Series |
| dataframe.head() | Show first rows | df.head() | DataFrame (first 5 rows) |
| : (slice notation) | Define range | df.iloc[:5, 1:3] | DataFrame subset |
# Import necessary library
import pandas as pd
# 📥 Load the rideshare dataset
rides = pd.read_csv('rideshare_data.csv')
# 👀 Display first 5 rows
print("First 5 rows of data:")
print(rides.head())
# 🎯 Select specific features (columns)
distance_data = rides[['distance']] # Returns DataFrame
print("\nDistance column:")
print(distance_data.head())
# 🔍 Select multiple features
selected_features = rides[['distance', 'price', 'destination']]
print("\nSelected features:")
print(selected_features.head())
# 📍 Access specific element (row 0, column 1)
element = rides.iloc[0][1]
print(f"\nElement at position [0][1]: {element}")
# 📊 Slice data (first 5 rows, columns 1-3)
subset = rides.iloc[:5, 1:3]
print("\nSubset of data:")
print(subset)
# 📈 Get basic information
print("\nDataset shape:", rides.shape) # (rows, columns)
print("Column names:", rides.columns.tolist())
Output:
First 5 rows of data: distance cab_type time_stamp destination price surge_multiplier 0 1.30 Uber 2018-12-01 13:08:04 Theatre District 17.5 1.0 1 1.35 Lyft 2018-11-29 12:22:57 South Station 7.0 1.0 2 1.10 Lyft 2018-12-18 09:15:09 Financial District 13.5 1.0 3 1.51 Lyft 2018-11-28 10:11:07 South Station 27.5 1.5 4 0.63 Uber 2018-11-26 20:08:09 Financial District 4.5 1.0 Distance column: distance 0 1.30 1 1.35 2 1.10 3 1.51 4 0.63 Dataset shape: (1000, 6)
┌─────────────────────────────────────────────────────┐ │ MACHINE LEARNING MODEL │ │ │ │ INPUT FEATURES MODEL OUTPUT │ │ (Explanatory) ────────► [🤖] ────────► (Target) │ │ │ │ • Distance • Price │ │ • Time │ │ • Location │ │ • Vehicle Type │ └─────────────────────────────────────────────────────┘
| Feature Type | Alternative Names | Role | Example |
|---|---|---|---|
| Input Features ⬅️ | Explanatory features, Predictors, X | Used to make predictions | Distance, time, location |
| Output Feature ➡️ | Target feature, Response, Y | What we want to predict | Price of rideshare |
Task: Predict rideshare price based on distance
import pandas as pd
import matplotlib.pyplot as plt
# Sample rideshare data
data = {
'distance': [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0],
'price': [8][12][15][18][22][25][28][32][35][40]
}
rides = pd.DataFrame(data)
# 🎯 Define Input and Output
X = rides[['distance']] # INPUT: Distance
y = rides[['price']] # OUTPUT: Price
print("Input Features (X):")
print(X.head())
print("\nOutput Feature (y):")
print(y.head())
# 📊 Visualize relationship
plt.figure(figsize=(10, 6))
plt.scatter(rides['distance'], rides['price'], color='blue', s=100, alpha=0.6)
plt.xlabel('Distance (miles)', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Rideshare Price vs Distance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()
This lesson contains multiple sub-topics. Click on any sub-topic below to read its content.
Get the updates, offers, tips and enhance your page building experience
Up to Top