Data Modeling and Feature Engineering

Afzal Badshah, PhD
7 min readApr 1, 2024

Data modelling is the cornerstone of successful data analysis and machine learning projects. It’s the crucial first step where you define the structure and organization of your data. Just imagine a construction project — before you start building, you need a blueprint to ensure everything fits together. Data modeling acts as the blueprint for your data, organizing it in a way that facilitates efficient exploration and model building. Here you can visit the detailed tutorial.

This process involves selecting a specific data model that best represents the relationships within your data and aligns with the intended use case. Here, we’ll explore some common data models, each with its own strengths and applications.

Data Modeling: Understanding the Landscape

Data Modeling

Relational Model: This is the most widely used model, structured with tables containing rows (records) and columns (attributes). Each table represents an entity (e.g., customer) and its attributes (e.g., name, address). Relational databases like MySQL and PostgreSQL utilize this model for efficient data storage and retrieval.

Dimensional Model: This model is specifically designed for data warehousing and business intelligence applications. It focuses on facts (measures you want to analyze, e.g., sales figures) and dimensions (categories that provide context to the facts, e.g., time, product, customer). This structure allows for efficient aggregation and analysis of large datasets.

Hierarchical Model: This model represents data with inherent parent-child relationships. It’s often used for representing organizational structures (e.g., company departments), file systems (folders and subfolders) or biological classifications (kingdom, phylum, class, etc.).

Graph Model: This model uses nodes and edges to represent entities and the relationships between them. Nodes can represent people, products, or any other entity, while edges depict the connections between them. Social networks like Facebook and Twitter leverage graph models to connect users and their interactions.

The choice of data model depends on the structure of your data and the intended use case. Consider the type of relationships between entities and the kind of analysis you want to perform when selecting the most suitable model.

Feature Engineering: The Art of Extracting Insights

Feature engineering is the art of transforming raw data into meaningful features that a machine-learning model can understand and use for predictions. Here’s a closer look at key concepts:

Feature Selection

Feature selection is a crucial step in building effective machine learning models. It involves identifying the most relevant features from your dataset that significantly contribute to predicting the target variable (what you’re trying to forecast). Focusing on these key features improves the efficiency and accuracy of your model by reducing noise and irrelevant information. Here are some key feature selection techniques:

Feature selection
  • Correlation Analysis: This technique measures the linear relationship between features and the target variable. Features with high positive or negative correlations with the target variable are likely to be informative for the model. For instance, a dataset predicting house prices might find a high positive correlation between square footage and price, indicating its significance.
  • Information Gain: This technique goes beyond correlation, calculating how much information a specific feature provides about the target variable. Features that effectively differentiate between different target values are more valuable. Imagine a dataset predicting customer churn (cancellations). Features like “frequency of purchases” or “recent customer service interactions” might have high information gain if they help distinguish between customers likely to churn and those likely to stay.
  • Feature Importance Scores: Some machine learning models can calculate feature importance scores that indicate how much each feature contributes to the model’s predictions. These scores can be a powerful tool for identifying the most important features for your specific model. For example, an image recognition model might assign high-importance scores to features related to color and shape for accurate object classification.

Feature Engineering

Feature engineering is the art of transforming raw data into features that are more interpretable and informative for your machine-learning model. Imagine you’re building a model to predict house prices. Raw features like “total square footage” and “number of bedrooms” are helpful, but what about capturing the influence of location? Here’s where feature engineering comes in:

Feature Engineering
  • Binning: Unearthing Hidden Patterns: Let’s say you have a continuous feature like “house age.” While the exact age might be useful, it might also be insightful to group houses into categories like “new (0–5 years old),” “mid-age (6–20 years old),” and “older (21+ years old).” This process, called binning, can help uncover non-linear relationships. For example, very old houses might require significant renovations, reducing their value compared to mid-age houses, even though their exact age might differ by just a few years.
  • Encoding Categorical Features: Speaking the Model’s Language: Imagine a feature for “property type” with values like “apartment,” “condo,” and “single-family home.” These can’t be directly fed into a model. Encoding techniques like one-hot encoding transform these categories into numerical representations (e.g., one-hot encoding creates separate binary features for each category, so “apartment” becomes [1, 0, 0] and “condo” becomes [0, 1, 0]). This allows the model to understand the relationships between these categories and the target variable (price).
  • Normalization and Standardization: Creating a Level Playing Field: Features can come in different scales. For instance, “house age” might range from 0 to 100 years, while “lot size” might be in square feet (potentially thousands). Some machine learning models are sensitive to these differences in scale. Normalization and standardization techniques scale all features to a common range (e.g., between 0 and 1 or with a mean of 0 and a standard deviation of 1). This ensures that features with larger scales don’t dominate the model’s learning process, allowing it to focus on the relationships between the features themselves and the target variable.
  • Feature Creation: Inventing New Weapons: Feature engineering isn’t just about transformation; it’s about creating entirely new features based on domain knowledge or mathematical operations. In our house price example, you could create a new feature “average price per square foot” by dividing the total price by square footage. This new feature might be more informative for the model than the raw features alone.

Feature Transformation Techniques: Polishing the Data for Better Predictions

Feature transformation involves modifying existing features to improve their quality and ultimately, the performance of your machine learning model. Here’s a closer look at some common techniques for handling real-world data challenges:

Feature Transformation

Taming Missing Values: The Imputation Rescue Mission: Missing data is a frequent roadblock in machine learning. Here’s how to address it:

  • Imputation: This strategy fills in missing values with estimated values. Imagine you have a dataset predicting customer churn (cancellations) with a missing value for a customer’s “last purchase amount.” You could use mean/median imputation to fill it with the average or median purchase amount of similar customers. More sophisticated techniques like K-Nearest Neighbors (KNN) imputation can find similar customers based on other features and use their purchase amounts to estimate the missing value.
  • Deletion: If a feature has a very high percentage of missing values, or imputation proves ineffective, removing rows or columns with missing data might be necessary. However, this approach can discard potentially valuable data, so it’s often a last resort.

Outlier Wrangling: Taming the Extremes: Outliers are data points that fall far outside the typical range for a feature. They can skew your model’s predictions. Here are some ways to handle them:

  • Winsorization: This technique caps outliers at a certain percentile (e.g., the 95th percentile) of the data distribution. Imagine a dataset on income with a single entry of $1 million (far above the average). Winsorization would replace this with the value at the 95th percentile, effectively capping the outlier’s influence.
  • Capping: Similar to winsorization, capping replaces outliers with a predefined value at the upper or lower end of the remaining data’s range.

Scaling for Harmony: Normalization and Standardization: Some machine learning models are sensitive to the scale of features. For instance, imagine features like “income” (in dollars) and “age” (in years). The vastly different scales can cause the model to prioritize features with larger values. Here’s how to address this:

  • Normalization: This scales features to a common range, typically between 0 and 1. It ensures all features contribute proportionally to the model’s learning process.
  • Standardization: This technique scales features to have a mean of 0 and a standard deviation of 1. It achieves a similar goal to normalization but can be more effective for certain algorithms.

By employing these feature transformation techniques, you can ensure your data is clean, consistent, and ready to be used by your machine learning model for accurate and reliable predictions.

--

--

Afzal Badshah, PhD

Dr Afzal Badshah focuses on academic skills, pedagogy (teaching skills) and life skills.