ml | Vivek Stha n team written weblog

Dimensionality reduction- One way to deal w Big files by dec. its size

Now, in above image last, I ask- what might be that PCA component1 or component2 in X or Y axis respectively?

That is covariance matrix elements.

So, how do we decide which is principal component1 and which is principal component2?

It is by:

euclidian vanya tye shortest distance ho

Hopefully, no math headache, code as below, however math process bujheko ramro

So, here, we see visually how pca reduces data. NOTE: dimensional redn is not cols reduction, which I thought

teso vaye, ya ta principal component1 matra dekhax, feri stratum ko ma ta diuta PCA banayera khichirathyo. correct fix my this understandg loophole

_PS: Heatmaps, t-SNE Plots, Multi-dimensional Scaling are just alternatives to Dimensional Reduction method

yt Zach Star, Josh Starmer_

Nothing to do w above context are posts from below on: stratification means breaking down population to smaller subsets ie samples such that based on certain criteria/ features, each subset almost proportionately reflects similar distribution of criteria/ features.

eg. if I have more red about 90%, less green about 10% balls, then in smaller bowl I put red n green bowls in similar ratio. Stra s’ud be done when population is v disproportionately distributed.

another context - xgboostA2Z

logistic-Regressn-2

Logistic-Regressn-moreThan2

data leakages in models fools one in such bad way that you are proud, about to yell that model is 90% accurate this, that

but in-fact 90% accuracy was not because of ml accurately tracing underlyling functions aka eqn that is mapping x to y.

I will give simple example: Lets say, we are trying to predict if Joe says yes or no, if we ask him to eat sthg

based on X:

x1: are we asking him to eat pizza or rice.
x2: are we asking him when he is alone or he is in group w other person
x3: What he has already answered in True or False to ‘Will you devour some food’

then, Y aka (Joe’s ans in yes or no to eat sthg) =

f(‘askingHimEatPizzaOrRice’, ‘AreWeAskingHimAloneOrGroup’,’HisAnsofWillYouDevourSomeFood’, )

If we train this above equation on train data’s X, Y, then this ML will give 100% accuracy on test data

WHY? - Its because whether he will eat or not is already pre-answered by Whether he will devour some Food. This type of leakage is leaking correct prediction or ground truth ie Y into X of train data.

so, knowing that this feature can be x not y ,or knowing that this feature can or cant be x or pre-checking this x is in fact just aliased y is very important. This is called feature engineering or feature selection.

btw, our 2019 mistake was, as far as I could infer was: leaking info from future into past- it was time series data

other leakage causes are:

leaking test data in to training data. btw, who would make such silly mistakes , haha
any of above-mentioned faults in 3rd party data joined to training set. rational mistake

another context:

These above algos are present in Scikit-learn libraries, which I stick to. so, idc about alternate libraries like keras, pyTorch etc.

Compiler mai sabai lekhya hunchh ta. Patiently padha Vivek