18CSE479T - Statistical Machine Learning UNIT 2 & 3 (4 MARKS)

4M:

State the assumptions in linear regression model

Linear regression has the following assumptions, failing which the linear regression model does not hold true:

The dependent variable should be a linear combination of the independent variables
No autocorrelation in error terms
No or little multicollinearity
Error terms should be homoscedastic
Errors should have zero mean and be normally distributed

Explain the Random forest algorithm

Involves 2 phases:

Working of the Random Forest Algorithm:

Step 1: select random k points from the training set
Step 2: build the decision trees associated with the selected data points
Step 3: choose the number N for decision trees that you want to build
Step 4: repeat step 1 & 2
Step 5: for new data points, find the predictions for each decision tree and assign the new data points to the category that wins the majority votes

Discuss on the curse of dimensionality

The KNN classifier is simple and can work quite well, provided it is given a good distance metric and has enough labeled training data.
However, the main problem with KNN classifiers is that they do not work well with high dimensional inputs.
The poor performance in high dimensional settings is due to the curse of dimensionality
The curse of dimensionality refers to the challenges and limitations encountered when dealing with data in high dimensional feature spaces
As the number of features increases, the volume of the space grows exponentially, leading to sparse data distribution
Example: Imagine a dataset with 5000 features and applying 5NN - in 1D we need to travel a distance of 5/5000= 0.001 on an average to capture 5 nearest neighbors
In d dimensions, we might need to travel (0.001)^ 1/d

State the Conditional probability theory

A conditional probability is the probability of one event, given that another event has occurred

Describe the Laplace estimator

In practice, some words never appear in past for specific category and suddenly appear at later stages, which makes entire calculations as zero
For example, in an example, W3 has a 0 value instead of 13 and it will convert entire equations to 0 altogether

In order to avoid this situation, Laplace estimator essentially adds a small number to each of the counts in the frequency table, which ensures that each feature has a nonzero probability of occurring. Usually laplace estimator is set to 1

Discuss the importance of the Variable importance plot.

The variable importance plot is a valuable tool statistical modeling and machine learning
It quantifies the impact of individual features on the model’s performance
It helps us understand which features contribute most to the model’s predictions
Why is it important?

Information value criteria of logistic regression

Information value(IV):

This is very useful in the preliminary filtering of variables prior to including them in the model
Eliminates major variables in the first step prior to fitting the model, as the number of variables present in the final model would be about 10

Akaike Information Criteria(AIC):

This measures the relative quality of a statistical model for a given set of data
It is a trade-off between bias versus variance
During a comparison between two models, the model with less AIC is preferred over high value

loveandhate