Children's shoes who have made models know that modeling is the type of assembly line. In the entire pipeline, we have sorted out two "most" contents that need to be paid attention to, namely "the most critical" and "the easiest to make mistakes". These two questions are naturally what novices should understand more clearly. The most critical part of
is the feature filtering of the model; the most easy part of
is the feature backtracking of the model.
First mentions features, and to put it bluntly, it is to describe the appearance of an objective thing. For example, the characteristics used by risk control model often include features such as credit reporting, consumer payment, long lending, equipment and other features.
First talk about the first "most" and most critical feature filtering here. At present, the process of overall feature filtering is as follows:
Edit to center
Add image annotation, no more than 140 words (optional)
As the saying goes: data determines the upper limit of the model, and the model only approximates this upper limit. The figure below shows the usage logic of data in the model development stage and the model online call stage.
edit to center
Add image annotation, no more than 140 words (optional)
In feature filtering, as long as you grasp the most important types of screening indicators in the model, you can do 80% of the feature filtering work, such as variable description statistics (missing rate/unique value/distribution ratio), variable stability PSI, variable discrimination IV, and the filtering threshold of each indicator, etc. We will explain in detail about how to filter these types of model indicators.
After talking about the first best, let’s talk about the second best: the model is most likely to make mistakes - the feature backtracking of the model.
First understand what feature backtracking is. Because the model is developed at the current time point, and the backtracking feature is a process after the sample design stage, it is often necessary to trace back to the time point before the default of its historical data, so such features are effective features.
Since there is data backtracking, data travel problems will occur. The so-called data travel is often said. Using x with y-featured to predict y (commonly known as using y to predict x) is also the most common problem that many students who make models are most likely to make. Give examples, such as the number of overdues and the number of collections to predict overdues.
So how to avoid data traversal? Here we provide you with common methods:
point is the use of observation points (avoiding the statistical point of the feature appearing in the performance period).
generally uses the customer's three elements + observation points to perform backtracking. The statistical point of the feature must be before the observation point of the sample, otherwise data travel will occur (borrow customer future information to predict the future).
More detailed content, if you are interested, you can follow:
edit to center
Add image annotation, no more than 140 words (optional)
...
~Original article
~Original article