In the previous post of this series about feature selection WhizzML scripts, we introduced the problem of having too many features in our dataset, and we saw how Recursive Feature Elimination helps us to detect and remove useless fields. In this second post, we will learn another useful script, Boruta.
We talked previously about this feature selection script. If you want to know more about it, visit its info page, which contains also the WhizzML code.
The Boruta script uses field importances obtained from an ensemble, to mark fields as important or unimportant. It does this process iteratively, labeling on each iteration the fields that are clearly important or unimportant and leaving the rest of fields to be labeled on later iterations. The previous version of this script didn’t have any configuration options, so we made the main two parameters of the algorithm configurable by the user:
- min-gain: Defines the minimum increment of the importance of one field compared to the importance of a field with random values. If the gain is higher than the value set, then it will be marked as important.
- max-runs: Maximum number of iterations.
As you can see, there is no n parameter that specifies the number of features to obtain. This is its main difference vs. other algorithms. Boruta assumes that the user doesn’t need to know what the optimal number of features is.
Let’s apply Boruta. We will use the dataset described in our previous post which contains information for multiple sensors inside trucks. These will be the inputs that we will use:
After 50 minutes, Boruta selects the following fields as important:
"cn_000", "bj_000", "az_000", "al_000", "am_0", "bt_000", "ci_000", "ag_001", "ag_003", "aq_000", "ag_002", "ck_000", "bu_000", "cn_004", "ay_009", "cj_000", "cs_002", "dn_000", "ba_005", "ee_005", "ap_000", "az_001", "ay_003", "cc_000", "bb_000", "ee_007", "ay_005", "cn_001", "ee_000"
29 fields were marked as important. Fields in bold and italics were also returned by Recursive Feature Elimination, as seen in the previous post. 18 of the 29 fields were returned by RFE. The ensemble associated with the new filtered dataset has a phi coefficient of 0.84. The phi coefficient of the ensemble that uses the original dataset was 0.824. Boruta achieved a more accurate model!
As we have seen, Boruta can be very useful when we don’t have any idea of the optimal number of features or we suppose that there are some features which are not contributing at all. Boruta discards fields which are not useful at all for the model. Therefore, we are removing features without subtracting from the model performance. In the third post of this series, we will cover the third script: Best First Feature Selection. Don’t miss it!
Leave a Reply