18CSE479T - Statistical Machine Learning UNIT 2 & 3 (4 MARKS)

 4M:


State the assumptions in linear regression model 

Linear regression has the following assumptions, failing which the linear regression model does not hold true:

  • The dependent variable should be a linear combination of the independent variables

  • No autocorrelation in error terms

  • No or little multicollinearity

  • Error terms should be homoscedastic

  • Errors should have zero mean and be normally distributed


Explain the Random forest algorithm

Involves 2 phases:

  • 1st phase: create the random forest by combining N decision trees

  • 2nd phase: make predictions for each tree in the 1st phase

Working of the Random Forest Algorithm:

  • Step 1: select random k points from the training set

  • Step 2: build the decision trees associated with the selected data points

  • Step 3: choose the number N for decision trees that you want to build

  • Step 4: repeat step 1 & 2

  • Step 5: for new data points, find the predictions for each decision tree and assign the new data points to the category that wins the majority votes


Discuss on the curse of dimensionality

  • The KNN classifier is simple and can work quite well, provided it is given a good distance metric and has enough labeled training data. 

  • However, the main problem with KNN classifiers is that they do not work well with high dimensional inputs. 

  • The poor performance in high dimensional settings is due to the curse of dimensionality

  • The curse of dimensionality refers to the challenges and limitations encountered when dealing with data in high dimensional feature spaces

  • As the number of features increases, the volume of the space grows exponentially, leading to sparse data distribution

  • Example: Imagine a dataset with 5000 features and applying 5NN - in 1D we need to travel a distance of 5/5000= 0.001 on an average to capture 5 nearest neighbors

  • In d dimensions, we might need to travel (0.001)^ 1/d



State the Conditional probability theory

  • A conditional probability is the probability of one event, given that another event has occurred


Describe the Laplace estimator

  • In practice, some words never appear in past for specific category and suddenly appear at later stages, which makes entire calculations as zero

  • For example, in an example, W3 has a 0 value instead of 13 and it will convert entire equations to 0 altogether

  • In order to avoid this situation, Laplace estimator essentially adds a small number to each of the counts in the frequency table, which ensures that each feature has a nonzero probability of occurring. Usually laplace estimator is set to 1


Discuss the importance of the Variable importance plot.

  • The variable importance plot is a valuable tool statistical modeling and machine learning

  • It quantifies the impact of individual features on the model’s performance

  • It helps us understand which features contribute most to the model’s predictions

  • Why is it important?

  • Feature selection

  • Model interpretability

  • Model evaluation

  • Business insights

  • Feature engineering

  • Model comparison


Information value criteria of logistic regression

Information value(IV):

  • This is very useful in the preliminary filtering of variables prior to including them in the model

  • Eliminates major variables in the first step prior to fitting the model, as the number of variables present in the final model would be about 10

Akaike Information Criteria(AIC):

  • This measures the relative quality of a statistical model for a given set of data

  • It is a trade-off between bias versus variance

  • During a comparison between two models, the model with less AIC is preferred over high value


Comments

Popular posts from this blog

18ECO124T - HUMAN ASSIST DEVICES UNIT 2 & 3 - 12M

18CSE483T - INTELLIGENT MACHINING UNIT 4 & 5 - 4M