ChatGPT와 함께 원 샷 러닝 배워보기 - 2 (restnet과 데이터 수집 방법)

seaofwest 2023. 2. 13. 12:59

이전에 pre-trained된 CNN을 사용하기로 결론을 냈고, 기존에 비슷한 구현을 했던 게시글에서 resnet-101을 썼다고 했기 때문에 우선 resnet-101 쪽으로 방향을 잡고 이어가 보았다.

- Can you give more details about resnet?

ResNet (Residual Network) is a popular deep learning architecture that was introduced in 2015 by Microsoft Research. It was designed to tackle the problem of vanishing gradients in deep neural networks, which can make it difficult to train very deep networks.

The key innovation in ResNet is the use of residual connections, which allow information to bypass one or more layers of the network and be directly passed to the output. The residual connections make it possible to train very deep networks because they allow the gradients to flow more easily through the network during backpropagation, avoiding the vanishing gradients problem.

A ResNet architecture consists of multiple residual blocks, each of which contains several convolutional and activation layers. In each residual block, the input to the block is added to the output of the block, creating a residual connection. This allows the network to learn the residual function, which represents the difference between the input and output of the block.

The ResNet architecture also includes batch normalization and skip connections, which help to improve the stability and accuracy of the network. Batch normalization helps to reduce the internal covariate shift, which occurs when the distribution of activations changes during training, by normalizing the activations in each layer. Skip connections allow information to bypass one or more layers, which can help to reduce the vanishing gradients problem and improve the network's performance.

ResNet has been very successful in a wide range of computer vision tasks, such as image classification, object detection, and semantic segmentation. The architecture is widely used as a baseline for new research, and many variations of ResNet have been proposed to further improve its performance.

- What is resnet-101

ResNet-101 is a specific variation of the ResNet (Residual Network) architecture. The number "101" in the name refers to the number of layers in the network. ResNet-101 has 101 layers, including the input and output layers, making it one of the deeper ResNet architectures.

ResNet-101 was introduced in the original ResNet paper and was trained on the ImageNet dataset, which contains over a million labeled images. The network achieved state-of-the-art results on the ImageNet classification task and demonstrated the ability of ResNets to effectively learn deep representations.

Like other ResNet architectures, ResNet-101 uses residual connections to tackle the problem of vanishing gradients and make it possible to train very deep networks. The residual connections allow information to bypass one or more layers of the network, making it possible to train networks with many layers without sacrificing accuracy.

ResNet-101 is widely used as a pre-trained model in computer vision tasks, such as transfer learning, fine-tuning, and feature extraction. The pre-trained weights can be used to initialize the network and provide a good starting point for further training on new datasets or for fine-tuning for specific tasks.

뭐라뭐라뭐라 하는데 이해할 수가 없었다. 추가적으로 구글링을 통해서 resnet에 대해서 찾아보았지만, 그렇구나~ 하고 넘기긴 해도 확실히 이해는 못한 것 같다. 일단 뭐든 성능이 좋다고 하니까 일단 써보면 나중에 이해할 수 있지 않을까?

CNN파트는 이것으로 넘기고 다음 내용으로 resnet을 통해 도출된 2가지 값들을 비교해서 유사도를 반환하는 부분이다.

- What loss function should I use in siamese network?

In a Siamese Neural Network, the most commonly used loss function is the contrastive loss function. The contrastive loss function is used to train the network to compare two input images and predict whether they are similar or dissimilar.

The contrastive loss function is defined as the following:

L = (1 - y) * d^2 + y * max(0, margin - d)^2

where d is the Euclidean distance between the features extracted from the two images, y is the label indicating whether the images are similar (y = 1) or dissimilar (y = 0), and margin is a hyperparameter that controls the decision boundary between similar and dissimilar images.

The contrastive loss function tries to minimize the distance between similar images and maximize the distance between dissimilar images. The decision boundary is controlled by the margin hyperparameter, which determines the minimum distance between dissimilar images. The contrastive loss function is designed to be differentiable, which allows the network to be trained end-to-end using gradient descent optimization.

In practice, the contrastive loss function is often used to train the CNN component of the Siamese Network, while the fully connected component is trained using a cross-entropy loss function. The combination of the two loss functions allows the network to learn both the image similarity and the class label predictions.

학습에 필요한 loss function을 물어보았다. 구글링을 통해 주로 Constrative loss function을 쓴다는 것은 알고 있었기 때문에 해당 식을 이해하는 데에 집중해보았다.

- In one shot learning, there is usually one image per one class, right?

답변 생략

- Then, if I have one image per class, how can I get a pair of images that are similar for training?

In one-shot learning, it is typical to use a few images from each class for training, not just one image per class. During training, you can form pairs of similar images by taking two images from the same class, and pairs of dissimilar images by taking one image from each of two different classes.

For example, let's say you have N classes and M images per class. You can form N * (M-1) / 2 pairs of similar images, and N * (N-1) / 2 pairs of dissimilar images. The total number of image pairs is N * M * (N-1) / 2.

Each pair of images is passed through the Siamese Neural Network, and the network outputs a similarity score. The similarity score is compared to the ground truth label (either 1 for similar images or 0 for dissimilar images), and the error is backpropagated through the network to update the weights.

In this way, the network is trained on a large number of image pairs, and learns to recognize similar and dissimilar images based on their feature representations. During testing, you can evaluate the network's performance by comparing the similarity score for a pair of test images to the ground truth label.

동일한 클래스의 이미지의 경우에는 1의 레이블을 주라고 했지만, 원샷 러닝에서는 클래스의 이미지가 1개 밖에 없어서 이것을 어떻게 물어봤지만, 클래스에 M개의 이미지가 있는 경우를 가정하고 이상한 답을 내놨다.

- How about I only have one image per class? I need one shot learning, not few shot learning

In the case where you only have one image per class, you would still use the Siamese Neural Network architecture for one-shot learning. However, you would need to use a different approach for generating training data.

One common approach is to augment the single image for each class to generate multiple variations of the same image. For example, you could apply random rotations, translations, scaling, and other transformations to each image to generate multiple versions of the same image. Then, you can form pairs of similar images by taking two augmented versions of the same image, and pairs of dissimilar images by taking one augmented version of one image and one augmented version of a different image.

Another approach is to use a pre-trained feature extractor to extract features from each image, and then compare the features instead of the images themselves. The feature extractor could be a Convolutional Neural Network (CNN), for example. During training, you would form pairs of similar features by taking two features from the same image, and pairs of dissimilar features by taking one feature from each of two different images.

Regardless of the approach, the goal is still to form a large number of image pairs, and train the Siamese Neural Network to recognize similar and dissimilar images based on their feature representations. During testing, you would evaluate the network's performance by comparing the similarity score for a pair of test images to the ground truth label.

원 샷 러닝에 대해서 답을 달라고 하니 드디어 2가지 해결책을 제시해 줬다. 첫째는 이미지 하나로 여러 가지 이미지들을 만들어 내는 image augmentation이고, 둘째는 pre-trained된 extractor를 사용하여 이미지 대신 비교한다는 것이다. 나의 경우에는 image augmentation이 적절한 방법인 것 같아서 이에 대해서 더 알아보았다.

- How can I make augmented image easily?

There are several ways to easily augment images in order to generate multiple variations of the same image. Here are a few commonly used techniques:

Random Rotations: You can rotate the image by a random angle, which will introduce variations in the orientation of the image.
Random Translations: You can translate the image by a random amount in the x and y direction, which will introduce variations in the position of the image.
Random Scaling: You can scale the image by a random factor, which will introduce variations in the size of the image.
Flipping: You can flip the image horizontally or vertically, which will introduce variations in the reflection of the image.
Color Jittering: You can add random noise to the image, which will introduce variations in the color of the image.

There are many tools and libraries that make it easy to augment images, including the Python libraries OpenCV and Pillow, as well as libraries for other programming languages like MATLAB and R.

When augmenting images, it is important to make sure that the augmentations are random and varied enough to create a diverse set of images that can effectively train the model. You may also need to experiment with different levels of augmentation in order to find the best balance between augmenting enough to provide diverse training data, but not so much that it creates unrealistic or unrepresentative images.

여러 번의 질문을 통해서 샴 네트워크를 구현하기 위해서는 1. image augmentation을 사용하여 클래스 마다 1개 밖에 없는 이미지만으로 데이터 셋을 생성하고, 이를 2. 기존에 pre-trained된 CNN인 resnet-101을 사용하여 벡터로 변환한 후 3. contrasitive loss fuunction을 loss function으로 사용하여 학습을 진행해야한다는 결론을 내게 되었다.

이후에 이런 저런 개선 방향들이 많긴 하지만, 우선 이 3가지에 집중하여 코드를 만들어보겠다.