How to use a pre-trained model from HuggingFace
Demo with a pre-trained image to text model
HuggingFace is getting quite popular these days. It has been called the GitHub for AI ML. It has plenty of models already available on its website that are ready to use. If you have a useful model, you should consider uploading it there as well.
In this blog, we are going to explore HuggingFace a little bit, pick a model and use it in our app.
We are going to choose an image-to-text model and build a react app that takes images as input, sends it to the model and gets a sentence back that explains the image.
You can find the code for the app in this repo.
Are you ready? Let's get started.
What is HuggingFace?
HuggingFace is an open-source community for all things machine learning. HuggingFace is famous for the large language models that are available for everyone to use.
It is a growing repository of ML models that are built by the open-source community. The models are hosted by HuggingFace and it provides ways to access these models.
Since all the models are hosted for public use, we can pick any model we need depending on our use case and plug it into our application.
Exploring models
Since the website hosts a lot of models let's see how to choose a good model for our use case.
On the HuggingFace website's top navigation bar, click on 'Models'.
Since we are interested in an image-to-text model, select the chip that says 'Image-To-Text'.
On the right panel, the models are now filtered to show only Image-To-Text models. Among all the models, feel free to click on different models and see how they are different and what suits you best.
What I did was choose one of the models that had high downloads and positive reviews.
If you can see the above image, there are models from Microsoft, Salesforce, Keras-io etc. I picked the Salesforce/blip-image-captioning-large which has over a million downloads.
When you click on the model, you can see a description of how the model was trained, the data set that was used, and the Spaces that are already using this model.
You can also the model's code files in the Files and Versions tab.
Okay, we have chosen our model, let's see how we can use it.
Using the model
HuggingFace allows different methods to use the model. On the model page, we can see different options when we click the Deploy button.
The 'Inference API' method is what we are going to use today as it is the fastest way to build a prototype or a small app. This does not require any new hardware and we can use the model from HuggingFace's hosting.
As you can see, the model is already exposed as an API endpoint that can be called with an image and it responds with text. This way, we don't essentially need a server backend to host the model.
We are going to build a simple React app that takes the user's input image, sends it to the HuggingFace server using this API, gets text back, and shows the text to the user.
There is one piece we need to call the API which is the authorization token that is required in the headers of the API request. I'll show you how to get the token in the next section.
I'd recommend you also try the 'Spaces - Deploy as a Gradio app in one click' option which deploys a simple app interface where you can upload an image and get the caption back.
Getting the token
On the Inference API pop-up, click on the token dropdown and click on 'New access token'.
On the Access Tokens page, click on 'New Token'.
On the create token pop-up, give a name and click on 'Generate a token'.
Now the token is created, click on the copy button and save your token.
Building an app interface
For this demo, I'm going to assume that you are familiar with basic React app development, so I'm going to directly jump into code.
In this app prototype, we are going to provide the user with three ways to interact with the model.
Upload an image from the computer and get an image caption
Open the camera and upload a live image to get a caption
Explore the gallery of loaded images and get captions on click.
Let's create a new React app using the following command
npx create-next-app
Once the basic template is ready, let's create a new component for uploading an image from the computer.
return (
<div className='upload'>
<h1>1. Upload Image</h1>
<h4>Upload image file and get explanation </h4>
<input type="file" accept="image/*" onChange={handleFileChange} />
<button onClick={handleUpload}>Get Explanation</button>
<div className='explanation-container'>
{isLoading
? (<div className="loader" />)
: caption && <p>Explanation: {caption}</p>
}
</div>
</div>
);
Create a new function that handles the uploaded image. In this function, we are going the call the model's end point with the image and the token.
const handleFileChange = (event) => {
setSelectedFile(event.target.files[0]);
};
const handleUpload = async () => {
if (!selectedFile) {
return;
}
try {
setIsLoading(true);
const response = await axios.post(API_URL, selectedFile, {
headers: {
Accept: 'application/json',
Authorization: `Bearer ${API_KEY}`,
'Content-Type': selectedFile.type
},
});
setCaption(response.data[0]['generated_text']);
} catch (error) {
console.error('Error uploading file:', error);
} finally {
setIsLoading(false);
}
};
That's it, our first use case is ready. Let's build the camera options.
Create a new div, that shows the camera buttons.
<div className='camera-options'>
<h1>2. Live Explanation</h1>
<h4>Capture image from camera and get explanation </h4>
<button onClick={startCamera}>Open Camera options</button>
{cameraActive && (
<div className="camera-container">
<video autoPlay playsInline></video>
<br />
<button onClick={handleCapture}>Start Camera</button>
<button onClick={handleUpload}>Upload</button>
<button onClick={stopCamera}>Stop Camera</button>
</div>
)}
</div>
Create functions to handle camera opening, and closing.
const [responseText, setResponseText] = useState('');
const [isLoading, setIsLoading] = useState(false);
const [cameraActive, setCameraActive] = useState(false);
const startCamera = () => {
setCameraActive(true);
};
const stopCamera = () => {
setCameraActive(false);
const videoElement = document.querySelector('video');
if (videoElement.srcObject) {
const stream = videoElement.srcObject;
const tracks = stream.getTracks();
tracks.forEach(track => track.stop());
videoElement.srcObject = null;
}
};
Create a function to handle the image capture from the camera. This is where we send the live image to the model to get the caption.
const handleCapture = async () => {
if (!cameraActive) return;
try {
const stream = await navigator.mediaDevices.getUserMedia({ video: true });
const videoElement = document.querySelector('video');
videoElement.srcObject = stream;
} catch (error) {
console.error('Error accessing camera:', error);
}
};
const handleUpload = async () => {
if (!cameraActive) return;
try {
const videoElement = document.querySelector('video');
const canvas = document.createElement('canvas');
canvas.width = videoElement.videoWidth;
canvas.height = videoElement.videoHeight;
canvas.getContext('2d').drawImage(videoElement, 0, 0, canvas.width, canvas.height);
canvas.toBlob(async (blob) => {
setIsLoading(true);
const response = await axios.post(API_URL, blob, {
headers: {
Accept: 'application/json',
Authorization: `Bearer ${API_KEY}`,
'Content-Type': blob.type
},
});
setResponseText(response.data[0]['generated_text']);
setIsLoading(false);
}, 'image/jpeg');
} catch (error) {
console.error('Error capturing and uploading image:', error);
} finally {
}
};
All done, we have two options for the user already. One with the image upload from the computer and another with the live camera.
The third option is really optional, but let's build it anyway.
So I got some free stock images from unsplash.com and saved them in the public folder in my app.
Let's create a component that displays this gallery of images.
<h1>3. Image Gallery</h1>
<h4>Click on any image to get explanation </h4>
<div>
{isLoading
? (<div className="loader" />)
: responseText && <p>Explanation: {responseText}</p>
}
</div>
<div className="image-gallery">
{imagePaths.map((imagePath, index) => (
<img
key={index}
src={`./images/${imagePath}`}
alt={`${index + 1}`}
className="gallery-image"
onClick={() => handleImageClick(imagePath)}
/>
))}
</div>
On clicking any image, let's call the model with the image as input and get a caption.
const handleImageClick = async (imagePath) => {
try {
const controller = new AbortController();
const signal = controller.signal;
const imageBlob = await fetch(`./images/${imagePath}`).then((response) => response.blob());
const timeoutId = setTimeout(() => {
controller.abort();
setResponseText(<div>Request timed out. Please try again later.</div>);
}, API_TIMEOUT);
setIsLoading(true);
const response = await axios.post(API_URL, imageBlob, {
headers: {
Accept: 'application/json',
Authorization: `Bearer ${API_KEY}`,
'Content-Type': imageBlob.type
},
}, signal);
clearTimeout(timeoutId);
setResponseText(response.data[0]['generated_text']);
} catch (error) {
console.error('Error uploading image:', error);
} finally {
setIsLoading(false);
}
};
That'll do it. We now have 3 ways to use the model.
Let's see how it all looks together.
By the way, I hosted this React App on my Google Cloud so you can interact and play with it.
Here is the public link to the app.
Conclusion
So what do you think? We just learned a way to use a pre-trained model from HuggingFace and made a real-life app out of it.
I guess this is what makes HuggingFace valuable. Imagine if you need to host your own model on a server. That would take a good amount of effort in terms of time and cost to keep the model up and running.
So if you have a production-ready model, perhaps you should consider hosting it on HuggingFace.
That's all for today. Thank you for reading and I hope you found this blog helpful. Please do leave a like to show some encouragement.
See you at the next one, cheers!
Uday