Working with Kaggle notebooks is super easy, thanks to the beautifully designed layout and notebook options. It is a great free tool for every data scientist. In this blog, we are going to explore how to deal with external Python libraries inside a Kaggle notebook.
Let's get started.
What is a Kaggle Notebook?
It's a Jupyter Notebook on the cloud. When I say cloud, in this case, it's on the Kaggle servers. If you have ever created a .ipynb file on your local machine you already know that it's a Python script which is used mainly for exploratory analysis of data.
A Jupyter Notebook is very useful because it contains code blocks that can be executed independently and without having to run the whole script each time.
For example, in the screenshot above, you can see two code blocks (a.k.a cells). In the first block, I initialised two variables x and y to a value of 10. I executed this first code block once.
In the second code block, I'm performing a mathematical sum operation on the variables x and y and printing the output (20).
Now with a Jupyter Notebook, I'm able to execute the second code block any number of times without having to execute the first code block. This isn't possible with a normal Python script file .py because you'd have to execute the whole script to run any line of code within it. These code cells are very useful for exploring data sets, visualising data and building machine-learning models because there is a lot of trial and error required and re-running the script each time is very inefficient.
A Kaggle Notebook is a Jupyter Notebook which can be created and run on Kaggle servers, which means you don't have to use your local machine's resources like RAM and CPU. This is especially advantageous because some of the machine-learning tasks are compute-intensive and require high-end infrastructure. Thanks to the guys at Google, we can get that compute power from their servers and all we need is a web browser like Google Chrome.
How does a Kaggle Notebook work?
To understand how a Kaggle Notebook works, we need to understand a concept called virtualisation.
Virtualization is the concept of creating isolated environments which can run on any operating system. Within these environments are the required libraries, applications, kernels etc which are required for the operations of that particular environment.
Think of it as a box within a box. The small box is happily doing its own thing and it can be moved around to other big boxes without disturbing the contents of the small box. These small boxes are also known as containers because they contain what's required for them to survive.
pic credit: Unsplash
So when we create a new Kaggle Notebook, it's a Jupyter Notebook file that is on one such container inside a Kaggle server. Inside this container are the basic things required to run a Kaggle Notebook such as Python files, basic libraries, a new notebook file etc.
The details of where the container is and what's available inside are masked and we don't need to know all those details anyway. But to be aware of the fact that it's just a file that is running on some remote computer somewhere in the world is important.
What's available with a Kaggle Notebook?
As a data explorer, you would expect Google to get the basics in place while building Kaggle and they certainly did.
Within the container that carries this Kaggle notebook are some preinstalled libraries along with the Python language itself.
In fact, when you create a new Notebook, the first block already imports two of the most used Python libraries for data studies, NumPy and Pandas.
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
There are some other built-in libraries available with the Notebook. Below is a list of some of them.
What does it mean that they're built-in?
It simply means that you don't have to install them separately to use them like you'd normally do on a local machine.
You can directly import the package onto the notebook and they will work.
In the above screenshot, you can see that I imported the TensorFlow library and in the next code block, I can access the libraries features. In other words, there are no errors.
But when I try to import a non-built-in library such as Yahoo Finance, for example, the Notebook throws an error.
This is happening because the library is not part of the container and Kaggle people didn't include it as part of the container package. It's understandable because we can't put all the Python libraries in every new container. That would make the container very large and frankly, it would be a waste because no one needs all libraries all the time.
Luckily, Kaggle doesn't stop us from installing any external libraries onto it. We are free to install any Python library and use it within the Notebook.
How? Let's see that next.
How to add external libraries?
There are two ways to add external libraries to a Kaggle Notebook.
Through the Terminal/Console
Through the Cells
Through the Terminal/Console
On a local machine, to install a Python library, we normally go to the terminal and execute the following command.
pip install <package_name>
We can do the same thing on a Kaggle Container. All we need to do is locate the terminal and execute the command.
As you can see in the screenshot, it's currently located at the bottom left of any Notebook. Click on it to access the terminal.
The terminal is also called a Console so please don't get confused with that. As you can see from the above screenshot, there is a place to enter a command. And we execute the pip command here.
The output looks something like this after the installation is done.
Now we can execute our code block again and see if the "yfinance" library is imported.
As you would expect, it now works.
However, there is another simpler way to this too. And that's directly from the cells.
Through the Cells
To install a library through a code block(cell) itself, you can simply use the following piece of code.
!pip install yfinance
Notice the ! at the beginning of the command.
As you can see, this has the same effect as the command in the terminal and now the package can be used.
So, in this blog post, we understood what a Kaggle Notebook is, and what's within a Kaggle container.
We also learnt about the built-in libraries and finally, we understood the different ways to install and use any external library.
I hope you found this helpful, and if you did please leave a like and I'd highly appreciate it.