Feature Engineering

Click to open the file…

This project, the Feature_Engineering_Solution.ipynb file, focuses on feature engineering techniques, leveraging domain knowledge and data manipulation to enhance predictive modeling for real estate pricing.

Click to see details

Mounting Google Drive in Google Colab:
- Enables seamless access to files stored in Google Drive for data loading.
Importing Required Libraries:
- Pandas: For DataFrame manipulation.
- Matplotlib: For data visualization.
- Enable inline plotting within the notebook.
- Seaborn: For enhanced visualizations.
Importing the Cleaned Dataset (cleaned__df.csv):
- Load the dataset into a Pandas DataFrame.
Exploring the Dataset:
- Display the first two records using the .head(2) function.
- Generate summary statistics with the .describe() function.
Quick EDA Hack:
- Install the profiling library using !pip install ydata-profiling.
- Import the ydata_profiling package to generate a Pandas Profiling Report, including:
  - Overview
  - Variables
  - Interactions
  - Correlations
  - Missing Values
  - Samples

I. Domain Knowledge:

Popular Properties – 2 Bedrooms and 2 Bathrooms:
- Create an indicator variable df['popular'] for properties with 2 beds and 2 baths.
- Check the number of properties with 2 baths and 2 beds using .value_counts().
Housing Market Recession – Lowest Housing Prices (2010–2013):
- Create a new variable df['recession'] to identify properties sold during this period.
- Check how many properties were sold during the recession using .value_counts().

II. Interaction Features:

Feature Engineering from Domain Knowledge:
- Property Age:
  - Create a new feature, df['property_age'], by subtracting year_built from year_sold.
  - Perform a sanity check by running df.describe() to verify the statistics for property_age.
  - Identify observations where property_age < 0 using .value_counts().
  - Remove rows where property_age is less than 0 to clean the dataset.

III. Dummy Variables:

Creating Dummy Variables:
- Generate dummy variables for all categorical features using the pd.get_dummies() function.
- Create dummy variables specifically for the property_type column.
- Perform a final check using the df.info() function.
Saving the Dataset:
- Save the processed dataset as final.csv using .to_csv().

By creating indicator variables, engineering interaction features, and encoding categorical variables, this project prepares a refined dataset for machine learning, culminating in a clean and ready-to-train model saved as final.csv.