Bird
Raised Fist0
ML Pythonml~20 mins

LightGBM in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - LightGBM
Problem:You are using LightGBM to classify whether a patient has a disease based on medical data.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.45
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to above 85% while keeping training accuracy below 92%.
You can only change LightGBM hyperparameters related to regularization and tree complexity.
Do not change the dataset or feature set.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
ML Python
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Create dataset for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

# Set parameters with regularization to reduce overfitting
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,  # smaller leaves to reduce complexity
    'max_depth': 5,   # limit tree depth
    'min_data_in_leaf': 20,  # avoid small leaves
    'feature_fraction': 0.8,  # use 80% features per tree
    'bagging_fraction': 0.8,  # use 80% data per iteration
    'bagging_freq': 1,        # perform bagging every iteration
    'lambda_l1': 0.5,         # L1 regularization
    'lambda_l2': 0.5,         # L2 regularization
    'verbose': -1
}

# Train model
model = lgb.train(params, train_data, num_boost_round=100, valid_sets=[val_data], early_stopping_rounds=10, verbose_eval=False)

# Predict and evaluate
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)

# Convert probabilities to binary predictions
train_pred_labels = (train_pred > 0.5).astype(int)
val_pred_labels = (val_pred > 0.5).astype(int)

train_acc = accuracy_score(y_train, train_pred_labels) * 100
val_acc = accuracy_score(y_val, val_pred_labels) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Reduced 'num_leaves' to 31 to limit tree complexity.
Set 'max_depth' to 5 to prevent very deep trees.
Added 'min_data_in_leaf' of 20 to avoid overfitting on small data splits.
Used 'feature_fraction' and 'bagging_fraction' at 0.8 to randomly sample features and data, adding randomness.
Added L1 and L2 regularization with 'lambda_l1' and 'lambda_l2' set to 0.5.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.45

After: Training accuracy 90.5%, Validation accuracy 86.3%, Training loss 0.18, Validation loss 0.32

Adding regularization and limiting tree complexity reduces overfitting. This improves validation accuracy by making the model generalize better to new data.
Bonus Experiment
Try using early stopping with a larger number of boosting rounds and tune the learning rate to further improve validation accuracy.
💡 Hint
Lower the learning rate (e.g., 0.01) and increase boosting rounds (e.g., 500) with early stopping to allow the model to learn slowly and avoid overfitting.

Practice

(1/5)
1. What is the main purpose of LightGBM in machine learning?
easy
A. To preprocess data by scaling features
B. To build fast and accurate decision tree models
C. To perform image recognition using neural networks
D. To cluster data points without labels

Solution

  1. Step 1: Understand LightGBM's role

    LightGBM is designed to create decision tree models quickly and accurately.
  2. Step 2: Compare with other options

    Options A, B, and D describe other machine learning tasks not related to LightGBM.
  3. Final Answer:

    To build fast and accurate decision tree models -> Option B
  4. Quick Check:

    LightGBM purpose = fast, accurate trees [OK]
Hint: LightGBM is known for fast tree models [OK]
Common Mistakes:
  • Confusing LightGBM with neural networks
  • Thinking LightGBM is for data scaling
  • Assuming LightGBM does clustering
2. Which of the following is the correct way to import LightGBM in Python?
easy
A. import lightgbm as lgb
B. import LightGBM
C. from lightgbm import LightGBM
D. import lgbm

Solution

  1. Step 1: Recall LightGBM import syntax

    The standard way is to import the package as import lightgbm as lgb.
  2. Step 2: Check other options

    Options B, C, and D are incorrect because they use wrong module names or syntax.
  3. Final Answer:

    import lightgbm as lgb -> Option A
  4. Quick Check:

    Standard import = import lightgbm as lgb [OK]
Hint: Use lowercase 'lightgbm' and alias 'lgb' [OK]
Common Mistakes:
  • Using capital letters in import
  • Trying to import non-existent submodules
  • Using wrong alias names
3. What will be the output of this code snippet?
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
train_data = lgb.Dataset(X_train, label=y_train)
params = {'objective': 'multiclass', 'num_class': 3, 'verbose': -1}
model = lgb.train(params, train_data, num_boost_round=10)
preds = model.predict(X_test)
preds_labels = preds.argmax(axis=1)
print(accuracy_score(y_test, preds_labels))
medium
A. An exception because of wrong parameter names
B. A list of predicted class labels
C. A syntax error due to missing import
D. A float value between 0 and 1 representing accuracy

Solution

  1. Step 1: Understand the code flow

    The code trains a LightGBM multiclass model on iris data and predicts test labels, then calculates accuracy.
  2. Step 2: Identify output type

    The print statement outputs accuracy_score, which is a float between 0 and 1.
  3. Final Answer:

    A float value between 0 and 1 representing accuracy -> Option D
  4. Quick Check:

    accuracy_score output = float between 0 and 1 [OK]
Hint: Accuracy score prints float between 0 and 1 [OK]
Common Mistakes:
  • Confusing predicted labels with accuracy output
  • Expecting a list instead of a float
  • Thinking code has syntax errors
4. Identify the error in this LightGBM training code:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'objective': 'binary'}
model = lgb.train(params, train_data, num_round=100)
medium
A. The 'objective' value 'binary' is invalid
B. The Dataset object is missing 'feature_name' argument
C. The parameter 'num_round' should be 'num_boost_round'
D. The import statement is incorrect

Solution

  1. Step 1: Check LightGBM training parameters

    The correct parameter for number of boosting rounds is 'num_boost_round', not 'num_round'.
  2. Step 2: Verify other parts

    'binary' is a valid objective, 'feature_name' is optional, and import is correct.
  3. Final Answer:

    The parameter 'num_round' should be 'num_boost_round' -> Option C
  4. Quick Check:

    Correct parameter name = num_boost_round [OK]
Hint: Use 'num_boost_round' for training rounds [OK]
Common Mistakes:
  • Using 'num_round' instead of 'num_boost_round'
  • Thinking 'binary' objective is invalid
  • Adding unnecessary parameters
5. You want to improve LightGBM model accuracy on a classification task. Which combination of actions is best?
hard
A. Increase num_boost_round and tune learning_rate
B. Decrease num_boost_round and remove categorical features
C. Use default parameters without tuning
D. Train with fewer data samples to reduce overfitting

Solution

  1. Step 1: Understand model tuning

    Increasing boosting rounds and tuning learning rate helps the model learn better patterns.
  2. Step 2: Evaluate other options

    Decreasing rounds or removing categorical features usually harms accuracy; training on fewer samples reduces data quality.
  3. Final Answer:

    Increase num_boost_round and tune learning_rate -> Option A
  4. Quick Check:

    Tuning rounds and learning rate improves accuracy [OK]
Hint: Tune rounds and learning rate for better accuracy [OK]
Common Mistakes:
  • Reducing training data to fix overfitting
  • Ignoring categorical features
  • Not tuning parameters at all