Feature Engineering
This project, the Feature_Engineering_Solution.ipynb file, focuses on feature engineering techniques, leveraging domain knowledge and data manipulation to enhance predictive modeling for real estate pricing.
Click to see details
- Mounting Google Drive in Google Colab:
- Enables seamless access to files stored in Google Drive for data loading.
- Importing Required Libraries:
- Pandas: For DataFrame manipulation.
- Matplotlib: For data visualization.
- Enable inline plotting within the notebook.
- Seaborn: For enhanced visualizations.
- Importing the Cleaned Dataset (
cleaned__df.csv):- Load the dataset into a Pandas DataFrame.
- Exploring the Dataset:
- Display the first two records using the
.head(2)function. - Generate summary statistics with the
.describe()function.
- Display the first two records using the
- Quick EDA Hack:
- Install the profiling library using
!pip install ydata-profiling. - Import the
ydata_profilingpackage to generate a Pandas Profiling Report, including:- Overview
- Variables
- Interactions
- Correlations
- Missing Values
- Samples
- Install the profiling library using
I. Domain Knowledge:
- Popular Properties – 2 Bedrooms and 2 Bathrooms:
- Create an indicator variable
df['popular']for properties with 2 beds and 2 baths. - Check the number of properties with 2 baths and 2 beds using
.value_counts().
- Create an indicator variable
- Housing Market Recession – Lowest Housing Prices (2010–2013):
- Create a new variable
df['recession']to identify properties sold during this period. - Check how many properties were sold during the recession using
.value_counts().
- Create a new variable
II. Interaction Features:
- Feature Engineering from Domain Knowledge:
- Property Age:
- Create a new feature,
df['property_age'], by subtractingyear_builtfromyear_sold. - Perform a sanity check by running
df.describe()to verify the statistics forproperty_age. - Identify observations where
property_age< 0 using.value_counts(). - Remove rows where
property_ageis less than 0 to clean the dataset.
- Create a new feature,
- Property Age:
III. Dummy Variables:
- Creating Dummy Variables:
- Generate dummy variables for all categorical features using the
pd.get_dummies()function. - Create dummy variables specifically for the
property_typecolumn. - Perform a final check using the
df.info()function.
- Generate dummy variables for all categorical features using the
- Saving the Dataset:
- Save the processed dataset as
final.csvusing.to_csv().
- Save the processed dataset as
By creating indicator variables, engineering interaction features, and encoding categorical variables, this project prepares a refined dataset for machine learning, culminating in a clean and ready-to-train model saved as final.csv.