Data Cleaning

Click to open the file…

In Data_Cleaning_Solution.ipynb, I focus on identifying and addressing issues in the dataset, such as handling missing values, detecting and removing outliers, and ensuring data consistency to prepare it for effective model building.

Click to see details
  • Mounting Google Drive in Google Colab:
    • Access files stored in Google Drive to enable seamless data loading.
  • Importing Libraries and Loading the Dataset:
    • Import the necessary Python libraries:
      • NumPy: For numerical computing.
      • Pandas: For data manipulation.
      • Matplotlib: For visualization.
      • Seaborn: For enhanced visualization.
    • Load the real__estate.csv file into a DataFrame.
  • Displaying Basic Dataset Information:
    • Use .head(), .tail(), and .sample() to view subsets of the data.
  • Displaying Formal Summary Statistics:
    • Summarize numerical features using the .describe() function.
  • Handling Missing Data:
    • Display unique values for basement and property_type.
    • Recognize that NaN values for basement indicate properties without a basement.
    • Replace NaN values with 0 using the .fillna() function.
    • Convert the basement column data type to integer.
  • Removing Outliers:
    • Import the warnings module to suppress warnings during visualization.
    • Create violin plots for beds, sqft, and lot_size to identify potential outliers.
    • Sort the lot_size column and display the top 5 largest values.
    • Examine rows with unusually large lot sizes.
    • Remove observations where lot_size exceeds 500,000 sqft, as they are deemed outliers.
  • Saving the Cleaned Dataset:
    • Save the cleaned DataFrame as cleaned_df.csv.
    • Verify the saved file by reloading it with Pandas.

This section highlights my ability to clean and preprocess data systematically, culminating in the creation of a cleaned dataset stored as cleaned_df.csv, which serves as the foundation for further analysis and model development.