Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

Open
linxy-1992 opened this issue Aug 21, 2024 · 0 comments

Comments

@linxy-1992
Copy link

linxy-1992 commented Aug 21, 2024

In some cases, when torch is being trained, there will be a sudden abort of training after a few epochs due to dimensional mismatch of the tensor, and it is not known what causes this problem. This problem occurs both in cpu and gpu.
But in the process of using it, I found that when writing the mod, avoiding repeated assignments to the same named object can reduce such a situation.
For example:
convnext_block<-nnn_module(
initialize=function(dim,dropout_p=0.1,layer_scale_init_value=1e-6){
self$conv<-nn_conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)
self$ln<-nn_layer_norm(dim)
self$linear1<-nn_linear(dim,dim4)
self$gelu<-nn_gelu()
self$linear2<-nn_linear(dim
4,dim)
#self$gamma<-nn_parameter(layer_scale_init_valuetorch_ones(1,1,1,dim))
self$dropout<-nn_dropout(dropout_p)
},forward=function(xcxb1){
xcxbresid<-xcxb1
xcxb2<-self$conv(xcxb1)
xcxb3<-xcxb2$permute(c(1,3,4,2))
xcxb4<-self$ln(xcxb3)
xcxb5<-self$linear1(xcxb4)
xcxb6<-self$gelu(xcxb5)
xcxb7<-self$linear2(xcxb6)
#xcxb<-xcxb
self$gamma
xcxb8<-xcxb7$permute(c(1,4,2,3))
torch_add(self$dropout(xcxb8),xcxbresid)})
n<-convnext_block(64)
n
In normal programming, it is sufficient to use the same name as ‘scxb’, but when the layers of the neural network are stacked, using the same name as ‘scxb’ will result in a tensor dimension error during the training process. Using xcb1,xcb2,xcb3,........ This pro-approach reduces the number of errors to a large extent, but it still cannot be avoided.
Finally, I found that this can be avoided by using rm;gc();cuda_empty_cache() during training: as follows:
model<-model$to(device=device)
for(epoch in 1:100){
optimizer<-optim_adam(model$parameters,lr=0.001)#optimiser
model$train()# set to train model
coro::loop(for(b in ministdlta){
optimiser$zero_grad()
output<-model(b[[1]]$to(device=device))
loss<-nnf_multilabel_soft_margin_loss(output,b[[2]]$to(device=device))
loss$backward()
optimiser$step()
})
rm(list=c(‘b’, ‘output’, ‘loss’))
gc()
cuda_empty_cache()
}
This will largely avoid tensor dimension errors during training.
I doubt that such a problem arises when the underlying code for the data tuning of the torch package has a vulnerability in the physical address of the data

Translated with DeepL.com (free version)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant