In the U-Net architecture, what is the main purpose of the skip connections between the encoder and decoder parts?
Think about how the network keeps details from the input image while reconstructing the output.
Skip connections in U-Net pass detailed spatial information from the encoder to the decoder. This helps the decoder recover fine details lost during downsampling, improving segmentation results.
Given an input tensor of shape (batch_size=1, channels=3, height=128, width=128), what is the output shape after one encoder block in a U-Net that applies two 3x3 convolutions (padding=1, stride=1) followed by a 2x2 max pooling (stride=2)?
import torch import torch.nn as nn class EncoderBlock(nn.Module): def __init__(self, in_channels, out_channels): super().__init__() self.conv = nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1), nn.ReLU(), nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1), nn.ReLU() ) self.pool = nn.MaxPool2d(kernel_size=2, stride=2) def forward(self, x): x = self.conv(x) p = self.pool(x) return x, p x = torch.randn(1, 3, 128, 128) block = EncoderBlock(3, 64) features, pooled = block(x) output_shape = pooled.shape
Remember that padding keeps spatial size after convolution, and max pooling halves height and width.
Each convolution keeps height and width at 128 due to padding=1. Max pooling with kernel 2 and stride 2 halves height and width to 64. The output channels are 64 as set in the block.
For a U-Net model designed to perform binary segmentation (classifying each pixel as foreground or background), which activation function is most appropriate to use in the final layer?
Consider the output as a probability for each pixel belonging to the foreground class.
Sigmoid activation outputs values between 0 and 1 for each pixel independently, suitable for binary segmentation. Softmax is for multi-class problems, ReLU and Tanh are not probability functions.
Which of the following formulas correctly computes the Dice coefficient for evaluating the overlap between predicted and ground truth segmentation masks?
Dice coefficient measures similarity by doubling the intersection over the sum of sizes.
The Dice coefficient formula is 2 times the intersection size divided by the sum of the sizes of prediction and ground truth sets, measuring overlap between masks.
In a U-Net decoder block, a concatenation of the upsampled feature map and the corresponding encoder feature map fails with a dimension mismatch error. Given that the encoder feature map has shape (batch_size, 64, 64, 64) and the upsampled decoder feature map has shape (batch_size, 64, 65, 65), what is the most likely cause?
Check how upsampling changes height and width compared to encoder features.
Upsampling sometimes produces spatial sizes off by one pixel due to rounding. This causes mismatch when concatenating with encoder features of fixed size. Adjusting upsampling or cropping fixes this.