Using the new ViT from TF Hub πŸŽ‰

I just tried out the new Vision Transformer (ViT) on TensorFlow Hub by adding two lines of code in the Custom component. Below is the full code, but the only lines I added was:

import tensorflow_hub as hub

and

input_=hub.KerasLayer(β€œhttps://tfhub.dev/sayakpaul/vit_s16_classification/1”)(input_)

NOTE: I had to install (pip install tensorflow_hub) in the same environment as perceptilabs was installed.

import tensorflow_hub as hub
class LayerCustom_LayerCustom_1Keras(tf.keras.layers.Layer, PerceptiLabsVisualizer):
    def call(self, inputs, training=True):
        """ Takes a tensor and one-hot encodes it """
    
                        
            
    input_ = inputs['input']

    input_=hub.KerasLayer("https://tfhub.dev/sayakpaul/vit_s16_classification/1")(input_)

    output = preview = input_
    
        
    self._outputs = {            
        'output': output,
        'preview': output,
    }
                        

    return self._outputs

def get_config(self):
    """Any variables belonging to this layer that should be rendered in the frontend.
    
    Returns:
        A dictionary with tensor names for keys and picklable for values.
    """
    return {}

@property
def visualized_trainables(self):
    """ Returns two tf.Variables (weights, biases) to be visualized in the frontend """
    return tf.constant(0), tf.constant(0)


class LayerCustom_LayerCustom_1(Tf2xLayer):
    def __init__(self):
        super().__init__(
            keras_class=LayerCustom_LayerCustom_1Keras
        )
3 Likes

Just to give this a little more context, here’s Papers with Code for the Vision Transformer.

TL;DR the abstract of the original paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very well on image classification tasks. When pre-trained on large amounts of
data and transferred to multiple mid-sized or small image recognition benchmarks
(ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent
results compared to state-of-the-art convolutional networks while requiring substantially
fewer computational resources to train.

(Emphasis added)

2 Likes