As part of our PCA release, we have released a series of blog posts, including a use case and a demonstration of the BigML Dashboard. In this installment, we shift our focus to implement Principal Component analysis with the BigML REST API. PCA is a powerful data transformation technique and unsupervised Machine Learning method that can be used for data visualizations and dimensionality reduction.
The first step in any BigML workflow using the API is setting up authentication. In order to proceed, you must first set the BIGML_USERNAME and BIGML_API_KEY environment variables, available in the user account page. Once authentication is successfully set up, you can begin executing the rest of this workflow.
export BIGML_USERNAME=my_name export BIGML_API_KEY=13245 export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"
Data sources can be uploaded to BigML in many different ways, so this step should be appropriately adapted to your data with the help of the API documentation. Here we will create our data source using a local file downloaded from Kaggle.
curl "https://bigml.io/source?$BIGML_AUTH" -F file=@mobile.csv
This particular dataset has a target variable called “price_range”. Using the API we can update the field type easily.
curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \ -X PUT \ -d '{"fields": {"price_range": {"optype": "categorical"}}}' \ -H 'content-type: application/json'
In BigML, sources need to be processed into datasets.
curl "https://bigml.io/dataset?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"source": "source/4f603fe203ce89bb2d000000"}'
Because we will want to be able to evaluate our model trained using PCA-derived features, we need to split the dataset into a training and test set. Here we will allocate 80% for training and 20% for testing, indicated by the “sample_rate” parameter.
curl "https://bigml.io/dataset?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054", "sample_rate": 0.8, "seed": "myseed"}'
By setting the parameter “out_of_bag” to True, we select all the rows that were not selected when creating the training set in order to have an independent test set.
curl "https://bigml.io/dataset?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"origin_dataset": "dataset/59c153eab95b3905a3000054", "sample_rate": 0.8, "out_of_bag": true, "seed": "myseed"}'
Our datasets are now prepared for PCA. The Principal Components obtained from PCA are linear combinations of the original variables. If the data is going to be used for supervised learning at a later point, it is critical not to include the target variable in the PCA, as it will result in the target variable being present in the covariate fields. As such, we create a PCA using all fields except “price_range”, using the “excluded_fields” parameter.
curl "https://bigml.io/pca?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/59c153eab95b3905a3000054", "excluded_fields": ["price_range"]}'
Next up, utilize the newly created PCA resource to perform a PCA batch projection on both the train and test sets. Ensure that all PCs are added as fields to both newly created datasets.
curl "https://bigml.io/batchprojection?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"pca": "pca/5423625af0a5ea3eea000028", "dataset": "dataset/59c153eab95b3905a3000054", "all_fields": true, “output_dataset”:true}'
After that, using the training set to train a logistic regression model that predicts the “price_range” class is very straightforward.
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/5af59f9cc7736e6b33005697", "objective_field":"price_range"}'
Once ready, evaluate the model using the test set. BigML will provide multiple classification metrics, some of which may be more relevant than others for your use case.
curl "https://bigml.io/evaluation?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/5af5a69cb95b39787700036f", "logisticregression": "logisticregression/5af5af5db95b3978820001e0"}'
Our final blog posts for this release will include additional tutorials on how to automate PCAs with WhizzML and the BigML Python Bindings. For further questions and reading, please remember to visit the release page and join the release webinar on Thursday, December 20, 2018. Attendance is FREE of charge, but space is limited so register soon!
Leave a Reply