Scikit-learn Machine Learning in Python
We are living in the world of data, where data in getting generated every second throughout the world. Data Science has become a critical component of numerous industries and domains due to the increasing availability of data and the need to derive valuable insights from it. As data continues to grow in volume and complexity, data science’s importance will only increase in the coming years. Machine Learning and Deep Learning concepts are gaining wide range of attraction in today’s era. Scikit-learn, also known as Sklearn, is a famous open-source machine learning library for Python, Scikit learn being one of the most important library, is widely used to handle Machine learning algorithms for Data preprocessing, Feature Selection, Model Training, Model Evaluation and more. It offers a comprehensive set of functionalities that make it easier for developers and researchers to implement and experiment with machine learning techniques. We will discuss it more in this blog. What is Scikit Learn? It is an open-source library used for handling both supervised and unsupervised Machine Learning algorithms. It offers a powerful toolkit to streamline Machine Learning workflows and thus helping in solving real-world problems effectively. 5 Key Features of Scikit Learn 1. Comprehensive algorithms: It offers a wide collection of supervised and unsupervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), k-means clustering etc. 2. Preprocessing and feature engineering: Scikit-learn provides various data preprocessing techniques, such as MinMaxScalar for Normalization, StandardScalar for standardization, OneHotEncoder for encoding categorical variables, SimpleImputer for handling missing values and many more. It also offers feature selection and extraction methods to improve model performance. 3. Model Building and Evaluation: Several Machine Learning models are imported using Sklearn library, some of them are Tree based models, Ensemble models, Linear models, SVM based models etc. For Evaluation purposes, Scikit learn is providing metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC-ROC). It also supports cross-validation and hyperparameter tuning for optimizing model parameters. 4. Integration with other libraries: Scikit-learn integrates with other Python libraries used in data science, such as pandas for data manipulation, matplotlib for visualization, and NumPy for basic numerical calculations and SciPy for advanced numerical calculations. 5. Easy-to-use API: Scikit-learn has a consistent and intuitive API, making it accessible for both beginners and experienced machine learning practitioners. Data Availability in Scikit Learning: Some of the data sets are available in this library for direct use. These data sets can be used for practising, exploring various data cleaning techniques, feature selection techniques, model evaluation techniques using scikit learn. Following Data Sets are available in this library: 1. Iris Dataset: It consists of measurements of sepals and petals of three different species of Iris flowers (Setosa, Versicolor, and Virginica). We need to classify these flowers based on measured values. 2. Boston Housing Dataset: The dataset contains information of housing prices in Boston. It contains 13 attributes and a numerical output variable. It can be used for regression problems. 3. Wine Dataset: The dataset contains chemical analysis of wine. It contains the details such as alcohol content, malic acid, and colour intensity. This dataset is used for classification purpose. 4. Breast Cancer Dataset: Digitized images of breast mass is present in the dataset. The attributes include various characteristics of the cell nuclei, and the task is to classify tumours as benign or malignant. 5. Digit Dataset: The Digit dataset consists of images of handwritten digits (0 to 9). The task is to classify the digit represented by each image. In addition to these, scikit-learn also provides utilities to load and work with other popular datasets, such as the California housing dataset, MNIST dataset, and more. How to load a particular dataset? import sklearn from sklearn import datasets #datasets imported from sklearn boston_data = datasets.load_boston() ##boston dataset got loaded ##data is getting saved in variable a and b. a = boston_data.data print(a.shape); #(506, 13) b = boston_data.target print(b.shape); #(506,) #we have printed shape Sklearn and Preprocessing: Scikit-learn provides a wide range of preprocessing techniques to prepare and transform your data before feeding it into machine learning algorithms. These preprocessing methods can help with data cleaning, normalization, scaling, encoding categorical variables, handling missing values, and more. Here are some commonly used preprocessing techniques in scikit-learn: Data Cleaning: Handling Missing Values: Simple Imputer can be used to replace missing values into different strategy (mean,median,most frequent etc). import numpy as np import pandas as pd a=pd.DataFrame({‘A’:[2,3,4],’B’: [3,2,np.NaN]}) a.isna() from sklearn.impute import SimpleImputer mean_imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’) a[“B”] = pd.DataFrame(mean_imputer.fit_transform(a[[“B”]])) BEFORE IMPUTATION – A B 0 False False 1 False False 2 False True AFTER IMPUTATION – A B 0 False False 1 False False 2 False False Outlier Detection: Scikit-learn offers various methods, such as the EllipticEnvelope and IsolationForest, for detecting and handling outliers in your data. Feature Scaling: We are taking a sample dataset of WindTurbine Failure in the code. data=pd.read_csv(‘cleaned_data.csv’) from sklearn.preprocessing import MinMaxScaler data_x=data.loc[:,data.columns!=’Failure_status’] mms=MinMaxScaler() data_x_norm=mms.fit_transform(data_x) data_x_norm=pd.DataFrame(data_x_norm,columns=data_x.columns) As MinMaxScalar is used, all the values are lying within the range of 0 to 1. Encoding Categorical Variables: One-Hot Encoding: The OneHotEncoder class encodes categorical features into binary vectors. Label Encoding: The Label Encoder class converts categorical labels into integer values. purpose purpose furniture/appliances 4 furniture/appliances 4 car0 0 car 1 car 1 car 1 car 1 car 1 business 3 business 3 car 1 car0 0 car 1 car 1 car0 0 car 1 business 3 business 3 car0 0 furniture/appliances 4 car 1 furniture/appliances 4 car0 0 As you can see before doing encoding, data were present in categorical format and after doing label encoding, data converted to numerical format. Feature Engineering using Scikit Learn: In simple words Feature Engineering is all about creating new features by working on existing features. This requires domain knowledge to get a new feature which could be increasing the accuracy of the model. Dimensionality Reduction: Principal Component Analysis (PCA): This decreases the number of columns by extracting