I have trained a Neural Network with a multi GRU layer in it. I did not use a L1/L2 regularization only gradient descent. I used the pytorch implementationnn.GRU(1024,1024,4,0.1)After training I checked the Wight Matrices.I found some strange effects in hh_l0_in:


For the learnable hidden-hidden layer the standard deviation of the secondary diagonal is a lot higher then in the rest of the Matrices. The standard deviation of the Wight Matrices for the first layer:
- ih_l0_ir
- avg:0.0005793942std:0.27631217
- avg:0.008843938std:0.28335693
- avg:2.674491e-05std:0.08714065
- avg:-0.00070823624std:0.087942585
- avg:4.8848284e-05std:0.12259285ske:0.024917802381963283kur:7.325522493002966hsk:27.256551562628268hta:1860.4869170125573
- avg:0.0023218004std:0.12732618ske:2.5480260834968385kur:16.072225736307036hsk:103.89389710637231hta:785.1604778707374
- avg:0.0003275321std:0.30212235
- avg:0.07846515std:2.5184803
- avg:-4.97445e-05std:0.2345184
- avg:-0.03676583std:2.0746834
- avg:-0.01897307std:0.707518
- avg:-19.582924std:1.5654982
The other layer have similar effects however the standard diviation of the output layer is different:
- hh_l3_inDiag
- avg:-18.329617std:8.657197
The hh_lX_in also has a average of ~-20. Is this a effect of the input data or is it normal for GRU to have this bigger standard deviation on the secondary matrix. What causes this effect?
