Today’s post is the fifth one of our series of blog posts about BigML Principal Component Analysis (PCA) unique implementation, the latest resource added to our platform. PCA is a different type of task in the Data Preparation phase described in the CRISP-DM methodology, which implies the creation of a new dataset based on an existing one.
As mentioned in BigML previous release, the data preparation is a key part of any Machine Learning project where a large number of operations are often required to get the best out of your data. Now, bringing PCA to the BigML Dashboard, API, and also WhizzML and Bindings for automation, you will be able to transform your data and, this time, to achieve dimensionality reduction by decreasing certain features in your dataset. Let’s dive in to learn how to automate BigML PCA with WhizzML and our Python Bindings. If you are new to WhizzML and would like to start automating your own Machine Learning workflows, we invite you to read this blog post to get started.
Creating a PCA
First of all, we are going to create a PCA from an existing dataset. We are assuming we want to reduce the number of features of this dataset translating it from its original form to another with fewer dimensions. The WhizzML code to do just that without specifying any parameter in the configuration looks like this:
;; creates a PCA with default configuration
(define my-pca (create-pca {"dataset" "dataset/5bcbd2b5421aa9560d000000"}))
The equivalent code using the BigML Python Bindings is:
from bigml.api import BigML api = BigML() my_pca = api.create_pca("dataset/5bcbd2b5421aa9560d000000")
This is the simplest way to create a PCA from a dataset. But there are some parameters that users can configure as needed. Now, let’s see the configuration option to replace a missing numeric value when creating a PCA:
;; creates a PCA setting the default numeric values
(define my-pca (create-pca {"dataset" "dataset/5bcbd2b5421aa9560d000000" "default_numeric_value" "median"}))
And the equivalent code in our Python Bindings:
from bigml.api import BigML api = BigML() args = {"default_numeric_value": "median"} my_pca = api.create_pca("dataset/5bcbd2b5421aa9560d000000", args)
This has been a simple example of how to add arguments during the PCA creation. You could similarly set many other values, for instance, the name of the new resource. For a complete list of the parameters available for PCA configuration please check the API documentation.
Creating a new projection
We have seen how PCA translates the data from one space to another, which is why we talk about “projections”. So, let’s assume we have a set of inputs and we want to apply the result of the PCA to them. Following the proper syntax, our set of data should be passed as input_data
, which is an object with pairs of field IDs and values in WhizzML.
;; creates a projection for the input data
(define my-projection (create-projection {"pca" "pca/5bcbd2b5421aa9560d000001" "input_data" {"000000" 3 "000001" "London"}))
And the equivalent code for the Python Bindings passes a dictionary, where the key is the field ID (or the field name) and the value is the value of the field.
from bigml.api import BigML api = BigML() input_data = {"000000": 3, "0000001": "London"} my_projection = api.create_projection("pca/5bcbd2b5421aa9560d000001", input_data)
Creating batch projections
Once you have created your PCA, it’s very likely you’ll want to apply the same transformation that we already applied in the example above to different data — and not just to one instance, but to a set of them. That’s what we call a batch projection. To create such a call takes at least two mandatory arguments: the PCA that was previously created and the set of data that we want to project.
;; creates the projection of a new set of data (define my-batchprojection (create-batchprojection {"pca" "pca/5bcbd2b5421aa9560d000001" "dataset" "dataset/5bcbd2b5421aa9560d000003"}))
The equivalent code for the Python Bindings is:
from bigml.api import BigML api = BigML() my_pca = "pca/5bcbd2b5421aa9560d000001" my_new_dataset = "dataset/5bcbd2b5421aa9560d000003" my_batch_projection = api.create_batch_projection(my_pca, my_new_dataset)
The result of our example will contain as features all the fields from the original dataset and the projected principal components. Users can also change and further adapt it to the output of the API call. Please, check the API documentation to see the available options.
What about dimensionality reduction?
PCA allows you to reduce the number of dimensions in a dataset by creating new features that best describe the variance in your data. The algorithm yields a number of Principal Components that preserve the dimensionality of the original dataset, and the new features are conveniently sorted according to the Percent Variance Explained of the original data. With this information, we can choose to eliminate a fraction of the Principal Component fields in order to reduce the number of features while preserving the maximum amount of useful information.
To help you do that, two key parameters allow you to select how many components of the PCA you want to use in the new mapping space. Those parameters are:
max_components
represents the integer number of components you want to employ in your new dataset andvariance_threshold
determines what percentage of the total variance in the original dataset you’d like to capture in the new space.
For example, let’s suppose that you want to explain at least 90% of the variance of your data with the components. The WhizzML code will be as follows:
;; creates a bathprojection that explain the 90% of variance (define my-batchprojection (create-batchprojection {"pca" "pca/5bcbd2b5421aa9560d000001" "dataset" "dataset/5bcbd2b5421aa9560d000003" "variance_threshold" 0.9}))
On the other hand, if we used the Python Bindings to code this creation, the equivalent code would be:
from bigml.api import BigML api = BigML() my_pca = "pca/5bcbd2b5421aa9560d000001" my_new_dataset = "dataset/5bcbd2b5421aa9560d000003" args = {"variance_threshold": 0.9} my_batch_projection = api.create_batch_projection(my_pca, my_new_dataset, args)
These two parameters are the most significant ones, but there are many other parameters that can be set for the batch projection creation. Check the complete list here.
Finally, feel free to check out the set of bindings that BigML offers for most popular programming languages, such as Java or Node.js.
Want to know more about PCA?
If you have any questions or you would like to learn more about how PCA works, please visit the release page and reserve your spot for the upcoming webinar about Principal Component Analysis on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!
Leave a Reply