Data Cleaning
In Data_Cleaning_Solution.ipynb, I focus on identifying and addressing issues in the dataset, such as handling missing values, detecting and removing outliers, and ensuring data consistency to prepare it for effective model building.
Click to see details
- Mounting Google Drive in Google Colab:
- Access files stored in Google Drive to enable seamless data loading.
- Importing Libraries and Loading the Dataset:
- Import the necessary Python libraries:
- NumPy: For numerical computing.
- Pandas: For data manipulation.
- Matplotlib: For visualization.
- Seaborn: For enhanced visualization.
- Load the
real__estate.csvfile into a DataFrame.
- Import the necessary Python libraries:
- Displaying Basic Dataset Information:
- Use
.head(),.tail(), and.sample()to view subsets of the data.
- Use
- Displaying Formal Summary Statistics:
- Summarize numerical features using the
.describe()function.
- Summarize numerical features using the
- Handling Missing Data:
- Display unique values for
basementandproperty_type. - Recognize that NaN values for
basementindicate properties without a basement. - Replace
NaNvalues with0using the.fillna()function. - Convert the
basementcolumn data type to integer.
- Display unique values for
- Removing Outliers:
- Import the
warningsmodule to suppress warnings during visualization. - Create violin plots for
beds,sqft, andlot_sizeto identify potential outliers. - Sort the
lot_sizecolumn and display the top 5 largest values. - Examine rows with unusually large lot sizes.
- Remove observations where
lot_sizeexceeds 500,000 sqft, as they are deemed outliers.
- Import the
- Saving the Cleaned Dataset:
- Save the cleaned DataFrame as
cleaned_df.csv. - Verify the saved file by reloading it with Pandas.
- Save the cleaned DataFrame as
This section highlights my ability to clean and preprocess data systematically, culminating in the creation of a cleaned dataset stored as cleaned_df.csv, which serves as the foundation for further analysis and model development.