Machine learning (ML) is a critical application of artificial intelligence (AI) used by Simility, a PayPal service, to help beat the fraudsters. The powerful algorithms we design can spot even the most sophisticated fraud attempts and are continually evolving to optimize detection, even as fraud patterns change. But how do we build these advanced models? A large part of the work that goes into this crucial task behind-the-scenes revolves around feature engineering and feature selection.
WHAT IS FEATURE ENGINEERING?
Raw data is the basic building block of ML algorithms. But on its own it can’t be used to accurately train these models. Instead, it must be refined to “features” – variables or attributes that can be used for analysis, such as name, age, sex, address etc. Feature engineering is the process of extracting features from raw data, and creating new relevant features from existing ones, in order to improve the predictive power of an ML algorithm.
Feature engineering is usually done by aggregating or combining attributes to create new features. Experts use this technique to generate high quality new features which may not be directly derived from the original raw data, and in so doing uncover hidden patterns in that data. But this isn’t the end of the story. The next step in the process is to eliminate any irrelevant, redundant or correlated features. This is known as feature selection.
HOW CAN FEATURE SELECTION HELP?
The features we use to train a ML model are crucial to its performance. This means selecting the most relevant ones possible is absolutely vital. A particular challenge for data scientists is known as the curse of dimensionality, where a dataset has a large number of attributes or features. Feature selection can help to address this in an effective way by removing any irrelevant, noisy features.
When done well, feature selection will:
- Enhance generalization: That is, improve the ML model’s ability to predict outcomes for previously unseen data. It does this by reducing the number of variables or “noisy” features
- Reduce training times: Reducing the number of variables reduces the computational cost and speeds up model building
- Increase model interpretability: By reducing the number of features the model becomes simpler and easier to interpret
- Improve accuracy: Features within the data are often highly correlated, which makes them redundant. Removing highly correlated features means the model will be less prone to make noise-based predictions
- Reduce prediction time: Reducing the number of variables will simplify the resulting model, which may speed up prediction times
Of course, not all feature selection approaches are created equal. They can be split into three main types: filter methods, wrapper methods and embedded methods. Filter methods eschew the ML algorithm entirely, and instead rely on characteristics of the data to filter features based on a given metric. In wrapper methods, a model is trained on a subset of features, and based on the inferences drawn from the previous model, a decision is made to add or remove features. Embedded methods combine elements of filter and wrapper methodologies, with feature selection embedded within the ML algorithm.
Simility, a PayPal service, is constantly evaluating new approaches to feature selection, and feature engineering in general, to improve the performance of our products. Global fraudsters are constantly innovating to circumvent existing detection tools and techniques. So we must also be at the cutting edge of data science to help minimize fraud costs and customer friction for our customers.