# Day 36

## Jul 26, 2017 · 701 words

## Normalization strikes again

I *thought* I had found the solution yesterday to my normalization problem since the plot looked much better. Today I tried running the *whole* plot (with the residuals and everything) where I saw my first red flag - the residuals were not displaying properly. After some time debugging, I noticed that my regression’s `predict`

function was returning different values for the same inputs depending on how many samples you passed in - not good!

When using the `normalize`

parameter this behavior does not occur. It makes sense why the issue is happening right now - I’m normalizing along the columns so as I add more columns (one for each input) the will all have to scale down accordingly. As to why I didn’t notice this before, it’s likely because I plot my regression line as 100 uniform samples, which is close to the 114 samples I used to create the model.

To get to the bottom of this I went to the scikit-learn source code and started a scavenger hunt for how the `normalize`

parameter is implemented. It led me to this `_preprocess_data`

function in the parent `linear_model`

class. Maybe I could manually call this function to perform the appropriate normalization procedure for the Huber regression?

I copied the Ridge regression’s call to `_preprocess_data`

:

```
self._preprocess_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
```

Then I replaced the references to `self`

with the variable for my Ridge object and `X`

and `y`

with my actual data variables, and it spat out some tuples so I hope it worked. I then tried the same process with a Huber estimator, and at first it gave an error because the estimator has no `copy_X`

attribute. So for the attributes that don’t exist for Huber estimators I went to the Ridge documentation and replaced them with Ridge’s default values (since I just want to replicate Ridge’s behavior). The Ridge default is `True`

which matches the `_preprocess_data`

default, so I can actually just delete that argument altogether. Finally I forced `normalize=True`

. It successfully returned the same tuple as the Ridge object did. Remember that this is just a preprocessing step, so there’s no reason for the two estimators to return different values yet.

The items it is returning are `X`

, `y`

, `X_offset`

, `y_offset`

, and `X_scale`

. Unfortunately the Huber estimator doesn’t use those parameters, so I’d have to somehow implement the manually. This seems like far too complex of a solution for what should be a simple problem. I’ve posted a help thread on a stats forum for a recommended solution so I don’t waste time trying to solve this on my own.

## The real solution(?)

I’m hestitant now to claim to have the “solution”, but I got a response to my post that seems to mostly fix my issue. It comes down to how you define normalization. In some contexts it means the scaling a vector to unit length (as the `Normalizer()`

object and `normalize()`

methods do), but in this case it’s referring to scaling values to have unit variance and mean of zero.

Therefore I should be putting `StandardScaler()`

in my pipeline rather than `Normalizer()`

. It doesn’t *completely* replicate the output of the `normalize`

parameter but it’s pretty close and can be adjusted by the estimator’s parameters.

Now I get consistent predictions regardless of how many items I predict on.

Here’s some plots comparing the Huber regression to Ridge, note that they look more similar at a higher $\epsilon$ value.

And here’s how the residual distribution looks:

It may look similar to the Ridge plot, but if you throw in some outliers you’ll quickly see the difference.

#### Robust Scalers

As another followup to my post, I learned that there is a `RobustScaler`

that as the name implies, uses a more robust method for scaling based of interquartile ranges. I also could potentially make my own scaler using the scaled MAD as a unit of variance. From my brief testing, these would take some debugging to implement properly (so far they make the fit worse).

#### To do

Tomorrow I will be catching up with Anil at the library. I’ve made decent progress with picking an estimator, next on my agenda is to focus on cross validation.