Skip to content

make_classification_check_data: Synthetic Classification Dataset

The make_classification_check_data function generates a synthetic dataset for validating classification models within the Machine Gnostics framework. This function creates simple blob-based clusters using Gaussian distributions, serving as a reliable "hello world" test for classification algorithms.


Overview

This utility simplifies the creation of classification datasets by generating clusters of points centered around randomly positioned centroids.

  • Method: Gaussian blobs with configurable separability.
  • Purpose: Unit testing models, verifying pipeline integrity, and demonstrating basic classification capabilities.
  • Customization: Easily adjust the number of samples, features, classes, and task difficulty (separability).

Parameters

Parameter Type Description Default
n_samples int Total number of data points to generate. 30
n_features int Number of input features (dimensions) per sample. 2
n_classes int Number of distinct classes (labels). 2
separability float Distance multiplier for class centers. Higher values = easier separation. 2.0
seed int Random seed for reproducibility. 42

Returns

Return Type Description
X numpy.ndarray Input feature array of shape (n_samples, n_features).
y numpy.ndarray Target label array of shape (n_samples,).

Example Usage

from machinegnostics.datasets import make_classification_check_data
import numpy as np

# Generate a 3-class dataset with 50 samples
X, y = make_classification_check_data(n_samples=50, n_classes=3)

print(f"X shape: {X.shape}")
print(f"Unique classes: {np.unique(y)}")
# Output:
# X shape: (50, 2)
# Unique classes: [0 1 2]

Author: Nirmal Parmar