Feature Engineering

Click to open the file…

This project, the Feature_Engineering_Solution.ipynb file, focuses on feature engineering techniques, leveraging domain knowledge and data manipulation to enhance predictive modeling for real estate pricing.

Click to see details
  • Mounting Google Drive in Google Colab:
    • Enables seamless access to files stored in Google Drive for data loading.
  • Importing Required Libraries:
    • Pandas: For DataFrame manipulation.
    • Matplotlib: For data visualization.
    • Enable inline plotting within the notebook.
    • Seaborn: For enhanced visualizations.
  • Importing the Cleaned Dataset (cleaned__df.csv):
    • Load the dataset into a Pandas DataFrame.
  • Exploring the Dataset:
    • Display the first two records using the .head(2) function.
    • Generate summary statistics with the .describe() function.
  • Quick EDA Hack:
    • Install the profiling library using !pip install ydata-profiling.
    • Import the ydata_profiling package to generate a Pandas Profiling Report, including:
      • Overview
      • Variables
      • Interactions
      • Correlations
      • Missing Values
      • Samples

I. Domain Knowledge:

  • Popular Properties – 2 Bedrooms and 2 Bathrooms:
    • Create an indicator variable df['popular'] for properties with 2 beds and 2 baths.
    • Check the number of properties with 2 baths and 2 beds using .value_counts().
  • Housing Market Recession – Lowest Housing Prices (2010–2013):
    • Create a new variable df['recession'] to identify properties sold during this period.
    • Check how many properties were sold during the recession using .value_counts().

II. Interaction Features:

  • Feature Engineering from Domain Knowledge:
    • Property Age:
      • Create a new feature, df['property_age'], by subtracting year_built from year_sold.
      • Perform a sanity check by running df.describe() to verify the statistics for property_age.
      • Identify observations where property_age < 0 using .value_counts().
      • Remove rows where property_age is less than 0 to clean the dataset.

III. Dummy Variables:

  • Creating Dummy Variables:
    • Generate dummy variables for all categorical features using the pd.get_dummies() function.
    • Create dummy variables specifically for the property_type column.
    • Perform a final check using the df.info() function.
  • Saving the Dataset:
    • Save the processed dataset as final.csv using .to_csv().

By creating indicator variables, engineering interaction features, and encoding categorical variables, this project prepares a refined dataset for machine learning, culminating in a clean and ready-to-train model saved as final.csv.