Creating an Image Dataset for Machine Learning: A Python Guide

Blue and green-themed illustration of creating an image dataset for machine learning, featuring Python programming symbols, image dataset icons, and machine learning diagrams.
  1. Web Scraping for Image Collection
    1. Identify Target Websites
    2. Install Necessary Libraries
    3. Retrieve HTML Content
    4. Parse HTML and Extract Image URLs
    5. Download Images
  2. Collect Images from Public Datasets
    1. Identify Relevant Datasets
    2. Download the Dataset
    3. Preprocess and Organize Images
    4. Verify and Label Images
  3. Image Augmentation for Diversity
    1. Random Rotation
    2. Random Scaling
    3. Random Flipping
  4. Labeling Images
    1. Manual Labeling
    2. Semi-Automatic Labeling
    3. Crowdsourcing
  5. Splitting the Dataset
    1. Training, Validation, and Testing Sets
  6. Resize and Normalize Images
    1. Consistency and Performance
  7. Save Dataset in Compatible Format
    1. Choosing the Right Format
    2. Organizing the Dataset
  8. Document the Dataset Creation Process
    1. Create a Step-by-Step Guide
    2. Include Code Snippets
    3. Explain Data Collection Methodology
  9. Regularly Update and Maintain the Dataset
    1. Keep It Relevant and Accurate
    2. Documentation Updates

Web Scraping for Image Collection

Identify Target Websites

Before starting with web scraping, identify websites that host images relevant to your machine learning task. Look for sites with clear, structured HTML where images are easily accessible.

Install Necessary Libraries

To scrape images, you need libraries such as BeautifulSoup and requests. Install these using pip:

pip install beautifulsoup4 requests

These tools will help you retrieve and parse HTML content.

Retrieve HTML Content

Use the requests library to fetch the HTML content of the target webpage. This content will be parsed to extract image URLs.

import requests
url = ''
response = requests.get(url)
html_content = response.content

Parse HTML and Extract Image URLs

Parse the retrieved HTML content using BeautifulSoup to locate and extract image URLs.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
images = soup.find_all('img')

image_urls = [img['src'] for img in images]

Download Images

Loop through the list of image URLs to download and save the images locally.

import os

os.makedirs('images', exist_ok=True)

for idx, url in enumerate(image_urls):
    img_data = requests.get(url).content
    with open(f'images/img_{idx}.jpg', 'wb') as handler:

Collect Images from Public Datasets

Identify Relevant Datasets

Public datasets like ImageNet, COCO, and Open Images Dataset are valuable resources. Choose a dataset that matches your machine learning task.

Download the Dataset

Most public datasets provide download links or APIs to fetch images. For instance, ImageNet offers tools to download their data.

Preprocess and Organize Images

After downloading, organize the images into directories based on their labels for easy access during training.

Verify and Label Images

Ensure that all images are correctly labeled and verify their quality. This step is crucial for accurate model training.

Image Augmentation for Diversity

Random Rotation

Apply random rotations to images to increase the variety in your dataset. This can help your model generalize better.

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(rotation_range=40)

Random Scaling

Scale images randomly to create variations in size, which can help the model handle different scales of objects.

datagen = ImageDataGenerator(zoom_range=0.2)

Random Flipping

Flip images horizontally and vertically to create mirror images, adding more diversity to the dataset.

datagen = ImageDataGenerator(horizontal_flip=True, vertical_flip=True)

Labeling Images

Manual Labeling

Label images manually if you have a small dataset. This method is time-consuming but ensures high accuracy.

Semi-Automatic Labeling

Use tools that combine automation and human oversight to label large datasets more efficiently.


Leverage platforms like Amazon Mechanical Turk to distribute the labeling task among many people.

Splitting the Dataset

Training, Validation, and Testing Sets

Divide your dataset into three parts: training (70%), validation (15%), and testing (15%). This ensures that your model is trained, validated, and tested on different subsets.

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)

Resize and Normalize Images

Consistency and Performance

Resize all images to a standard size and normalize pixel values to a range of 0 to 1 for consistent input into the machine learning model.

from keras.preprocessing.image import img_to_array, load_img

img = load_img('image.jpg', target_size=(224, 224))
img = img_to_array(img) / 255.0

Save Dataset in Compatible Format

Choosing the Right Format

Save your dataset in formats such as TFRecord for TensorFlow or HDF5 for Keras, which are optimized for efficient loading and training.

Organizing the Dataset

Ensure that the dataset is well-organized, with a clear directory structure and naming conventions for easy access.

Document the Dataset Creation Process

Create a Step-by-Step Guide

Document each step of your dataset creation process to ensure reproducibility and clarity.

1. Data Collection: Describe the sources and methods used to collect images.
2. Data Preprocessing: Detail how images were resized, normalized, and augmented.
3. Data Annotation: Explain the labeling process.
4. Data Splitting: Outline how the dataset was divided into training, validation, and testing sets.
5. Include code snippets for each step.

Include Code Snippets

Provide clear and concise code examples for each step of the process, enabling others to replicate your work.

# Example for resizing and normalizing images
from keras.preprocessing.image import img_to_array, load_img

def preprocess_image(file_path):
    img = load_img(file_path, target_size=(224, 224))
    img = img_to_array(img) / 255.0
    return img

Explain Data Collection Methodology

Detail the methodologies used to collect and verify images, ensuring transparency and accuracy.

Regularly Update and Maintain the Dataset

Keep It Relevant and Accurate

Regularly update your dataset with new images to keep it current and relevant to evolving machine learning tasks.

Documentation Updates

Keep the documentation up-to-date with any changes in the dataset or methodology to maintain accuracy and usability.

By following these detailed steps, you can create a robust and diverse image dataset for your machine learning projects using Python. This guide ensures that your dataset is well-organized, properly labeled, and suitable for training high-performing models.

If you want to read more articles similar to Creating an Image Dataset for Machine Learning: A Python Guide, you can visit the Applications category.

You Must Read

Go up

We use cookies to ensure that we provide you with the best experience on our website. If you continue to use this site, we will assume that you are happy to do so. More information