CodeSphere - Expand Your Coding Knowledge

Lesson 2: Data Collection and Cleaning

Objectives:

Understand different methods of data collection.
Learn how to clean and preprocess data for analysis.
Work with real-world datasets using Python.

1. Methods of Data Collection

Data can be collected from various sources, including:

Web Scraping: Extracting data from websites.
APIs: Using Application Programming Interfaces to retrieve data from online services.
Databases: Querying data from SQL or NoSQL databases.
CSV/Excel Files: Importing data from file formats commonly used for data storage.

Code Example: Web Scraping with BeautifulSoup

import requests
from bs4 import BeautifulSoup

# Fetching a web page
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')

# Extracting titles from the page
titles = [title.text for title in soup.find_all('h1')]
print(titles)

2. Data Cleaning

Data cleaning is crucial for ensuring that your data is accurate, consistent, and usable. Common tasks include:

Handling Missing Values: Filling in, interpolating, or removing missing data.
Removing Duplicates: Identifying and removing duplicate entries.
Data Transformation: Converting data types and normalizing values.

Code Example: Data Cleaning with Pandas

import pandas as pd

# Sample DataFrame with missing values and duplicates
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'Age': [24, 27, None, 24]}
df = pd.DataFrame(data)

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Removing duplicates
df.drop_duplicates(inplace=True)

print(df)

3. Data Preprocessing Techniques

Normalization: Scaling data to a range (e.g., 0 to 1).
Encoding Categorical Variables: Converting categorical data into numerical format using techniques like one-hot encoding.
Feature Engineering: Creating new features or modifying existing ones to improve model performance.

Code Example: Encoding Categorical Variables

from sklearn.preprocessing import OneHotEncoder

# Sample DataFrame with categorical data
data = {'Color': ['Red', 'Green', 'Blue']}
df = pd.DataFrame(data)

# One-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df[['Color']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Color']))
print(encoded_df)

4. Working with Real-World Datasets

Practice loading and exploring datasets from real-world sources:

Kaggle Datasets: Kaggle offers a variety of datasets for practice.
UCI Machine Learning Repository: UCI provides numerous datasets for machine learning research.

Code Example: Loading a CSV File

# Loading a CSV file
df = pd.read_csv('data.csv')

# Displaying the first few rows
print(df.head())

# Basic data exploration
print(df.info())
print(df.describe())

5. Key Takeaways

Data collection methods vary based on the source and format of data.
Effective data cleaning and preprocessing are critical steps before analysis or modeling.
Familiarize yourself with tools and techniques to handle real-world data challenges.

6. Homework/Practice:

Practice web scraping or use an API to collect a dataset.
Clean and preprocess a dataset of your choice.
Explore and visualize the dataset using Pandas and Matplotlib.

Data Science Fundamentals