Logistic Regression and Regularized Logistics Regression Applied to Estimating  Probabilities

 

Background

Problem 1. Maximization of the log-likelihood function

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Solution in MATLAB Toolbox

Problem 2. Maximization of the log-likelihood function minus additional regularization term

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Solution in MATLAB Toolbox

Problem 3. Maximization of the log-likelihood function subject to constraint on cardinality

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Solution in MATLAB Toolbox

Problem 4. 4-fold Cross-validation for maximization of the log-likelihood function

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Solution in MATLAB Toolbox

 

Background

This case study finds an optimal estimate of the cesarean section (CS) rate in a population. The risk of difficult labor is described by a mathematical model that depends on measurable demographic factors. We use regular (“plain vanilla”) logistic regression and a regularized logistic regression to evaluate the effects of demographic factors on the probability of CS. This case study considers 6 primary factors: age, height, weight, maternal weight gain, gestational age, and birth weight.

We made 4-fold cross-validation for Problem 1. The optimization problem was run 4 times. In each run we selected ¾ of data as in-sample dataset (on which we calibrated the model). Then, we tested the performance of the model on the remaining out-of-sample dataset containing ¼ part of data.

 

Problem 1

Maximization of the log-likelihood function (“plain vanilla” logistic regression).

 

Simplified Problem Statement

 

maximize                                      

 logexp_sum            

                                             

Value:                                        

 logistic      

 

where

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

logistic = calculates values of logistic function for every observation (scenario)

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

Number of Variables

6

Number of Scenarios

12,690

Objective Value

-0.495793

Solving Time (sec)

0.08

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with riskprog and riskconstrprog PSG subroutines (General (Text) Format of PSG in MATLAB):

Description (riskprog and riskconstrprog)

 

Input Files to run CS:

 MATLAB code (.txt file)
 Data (.zip file with.mat file)

 

Solution in MATLAB Toolbox

 

Description

 

Input Files to run CS:

Data (.zip file with .mat file)

 

 

Problem 2

Maximization of the log-likelihood function minus additional regularization term (regularized logistic regression).

 

Simplified Problem Statement

 

maximize                                      

 logexp_sum            

 -polynom_abs          

                                             

Value:                                        

 logistic          

 

where

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

logistic = calculates values of logistic function for every observation (scenario)

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

Number of Variables

6

Number of Scenarios

12,690

Objective Value

-0.498204

Solving Time (sec)

0.05

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with riskprog and riskconstrprog PSG subroutines (General (Text) Format of PSG in MATLAB):

Description (riskprog and riskconstrprog)

 

Input Files to run CS:

 MATLAB code (.txt file)
 Data (.zip file with .m and .mat files)

 

Solution in MATLAB Toolbox

 

Description

 

Input Files to run CS:

Data (.zip file with .mat file)

 

 

Problem 3

Maximization of the log-likelihood function subject to constraint on cardinality.

 

Simplified Problem Statement

 

maximize                                      

 logexp_sum          

Constraint: <= 4

 cardn        

Solver: precision = 9                        

                                             

Value:                                        

 logistic  

 

where

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

cardn = cardinality function

logistic = calculates values of logistic function for every observation (scenario)

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

Number of Variables

6

Number of Scenarios

12,690

Objective Value

-0.497135

Solving Time (sec)

0.35

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with riskprog and riskconstrprog PSG subroutines (General (Text) Format of PSG in MATLAB):

Description (riskprog and riskconstrprog)

 

Input Files to run CS:

 MATLAB code (.txt file)
 Data (.zip file with .m and .mat files)

 

Solution in MATLAB Toolbox

 

Description

 

Input Files to run CS:

Data (.zip file with.mat file)

 

 

Problem 4

4-fold Cross-validation (4 in-sample data and 4 out-of-sample data) for maximization of the log-likelihood function.

 

Simplified Problem Statement

 

4-fold crossvalidation

Maximize logexp_sum

 

Value:

logistic (function Logistic on the in-sample data)

logistic (function Logistic on the out-of-sample data)

 

where

crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

logistic = calculates values of logistic function for every observation (scenario)

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

For one problem in Cross-validation:

 

 

Dataset1

Dataset2

Dataset3

Dataset4

Number of Variables

6

6

6

6

Number of Scenarios

9,517

9,517

9,517

9,517

Objective Value

-0.496

-0.495

-0.498

-0.494

Solving Time (sec)

0.15

0.18

0.05

0.08

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with tbpsg_run function (PSG MATLAB Toolbox):

Description (tbpsg_run)

 

Input Files to run CS:

 MATLAB code (.txt file)
 Data (.zip file with .m and .mat files)

 

Solution in MATLAB Toolbox

 

Description

 

Input Files to run CS:

Data (.zip file with.mat file)