_{1}

^{*}

Penalized ordinal outcome models were developed to model high dimensional data with ordinal outcomes. One option is the penalized stereotype logit, which includes nonlinear combinations of parameter estimates. Optimization algorithms assuming linearity and function convexity were applied to fit this model. In this study the application of the adaptive moment estimation (Adam) optimizer, suited for nonlinear optimization, to the elastic net penalized stereotype logit model is proposed. The proposed model is compared to the L1 penalized ordinalgmifs stereotype model. Both methods were applied to simulated and real data, with non-Hodgkin lymphoma (NHL) cancer subtypes as the outcome, with results presented and discussed.

Many research studies seek to predict related outcomes given a set of independent variables or to quantify the relationship between them. In certain instances, the outcome of interest is ordinal. Ordinal variables are defined as having distinct ordered levels; however, the distance between the levels cannot be ascertained. An example of an ordinal variable is cancer stage. Take, for instance, testicular seminoma, a germ cell tumor in the sperm of the testes [

1) Tumor stage 1, cancer has not spread beyond the testicle.

2) Tumor stage 2, cancer has spread to the blood or lymphatic vessels.

3) Tumor stage 3, cancer has spread beyond the lymphatic and blood vessels nodes to the spermatic cord.

4) Tumor stage 4, cancer has spread beyond previously mentioned areas to other parts of the body.

The ordering of categories is evident. The aim of statistical and machine learning models is to quantify the relationship between covariates and associated outcome so that one can predict the outcome variable and assess the relationship between the two with statistical significance. The range of ordinal outcome models includes cumulative logit, proportional odds model, adjacent-category logit [

In addition, we now live in an era of high dimensional data, and massive amounts of information are being collected [

Penalized ordinal outcome models were developed to analyze high dimensional data with ordinal outcomes. Some of these modeling schemes are glmnetcr [

This study investigates the extension of a previously developed elastic net penalized stereotype logit [

For a given observation i (there are a total of n observations), denote the outcome vector y i as ( y i 1 , y i 2 , ⋯ , y i J ) where y i j = 1 if for that observation, the outcome is in the j^{th} category, and all other entries are set to 0. There are J possible outcomes. Denote the vector x i = ( x i 1 x i 2 ⋯ x i p ) ′ as the covariate vector consisting of p values. The log of the information entropy, based on a multinomial distribution, is represented as

L ( θ | y , x ) = 1 n ∑ i = 1 n [ ∑ j = 1 J − 1 y i j θ i j + log π J ( x i ) ] (1)

where

θ i j = log π j ( x i ) π J ( x i ) , (2)

and

π j ( x i ) = e θ i j ∑ j = 1 J e θ i j . (3)

The log of the odds ratio, with level J being the reference level, θ i j , is represented as α j + ϕ j { x ′ i β } . Therefore, π j ( x i ) are now modeled as

exp ( α j + ϕ j x ′ i β ) 1 + ∑ j ' = 1 J − 1 exp ( α j ′ + ϕ j ′ x ′ i β ) . (4)

This representation is known as the stereotype logit [

L ( β , α , ϕ | y , x ) = 1 n ∑ i = 1 n ∑ i = 1 J − 1 y i j ( α j + ϕ j { x ′ i β } ) − log 1 + ∑ j = 1 J − 1 e α j + ϕ j { x ′ i β } . (5)

We take the log of the information entropy, with a stereotype logit parameterization, and add an elastic net penalty [

ι = λ 2 n ∑ k = 1 p ( ς β k 2 + ( 1 − ς ) | β k | ) , (6)

where 0 < λ < ∞ can vary and n is the sample size of the dataset. For this study, ς was set to 0.5. The goal of the elastic net is to penalize large values of the parameter estimates, forcing their magnitude to decrease in proportion to their size. During optimization, the system is forced shrink the parameter estimates’ size when finding an optimal solution.

Based on the log of the information entropy, we are concerned with finding estimates for parameters, β ^ , α ^ , and ϕ ^ such that

( β ^ , α ^ , ϕ ^ ) = arg max β , α , ϕ L ( β , α , ϕ | y , x ) (7)

where α ^ denotes the vector of length J − 1 containing the intercepts for the J − 1 logits and ϕ ^ denotes the vector on length J − 1 containing the intensity parameters. In addition, minimizing the negative log entropy is equivalent to maximizing the log entropy, and we will work with the negative representation. Therefore, after imposing the elastic net penalty, we are concerned with finding parameter estimates such that:

( β ^ , α ^ , ϕ ^ ) = arg min β , α , ϕ { − L ( β , α , ϕ | y , x ) + λ 2 n ∑ k = 1 p ( ς β k 2 + ( 1 − ς ) | β k | ) } (8)

In machine learning, for any given model there is usually a hyperparameter set, a set of parameters that is not optimized over but whose choice of values affects the final solution. In machine learning, if there are multiple hyperparameters for any optimization procedure, there is still no established method to select these to optimize the function with respect to the parameters [_{j}, β_{k}, and ϕ_{j} are presented.

∂ L ∂ α j = 1 n ∑ i = 1 n ( y i j − π i j ) (9)

∂ L ∂ β k = 1 n ( ∑ i = 1 n x i k ∑ j = 1 J − 1 ϕ j ( y i j − π i j ) − λ ( ς β k + ( 1 − ς ) s i g n ( β k ) / 2 ) ) (10)

∂ L ∂ ϕ j = 1 n ∑ i = 1 n ( y i j − π i j ) ∑ k = 1 p x i k β k (11)

Denote the full parameter set β, α, and ϕ as ψ. The partial derivatives are vectorized (placed into one vector) and are represented by a derivative vector, denoted ∇ L ( ψ ) .

The implemented Adam algorithm [

1) Initialize m and s to have all zero entries; these vectors are of length p + 2 ( J − 1 ) .

2) Initialize ψ using He Initialization [

3) Compute ∇ L ( ψ ) .

4) m ← ν 1 m + ( 1 − ν 1 ) ∇ L ( ψ ) .

5) s ← τ 2 s + ( 1 − τ 2 ) ∇ L ( ψ ) 2 .

6) ψ ← ψ − η m ÷ s + ε .

7) Repeat steps 3 through 6 until L ( ψ | y , x ) i + i − L ( ψ | y , x ) i < ζ , where i references the iteration number, or until a prespecified number of iterations are reached.

The vectors m and s contain the exponentially decaying averages of ∇ L ( ψ ) and ∇ L ( ψ ) 2 . For the He initialization [

N o r m ( 0 , 1 ) * 2 / p , (12)

where N o r m ( 0 , 1 ) are randomly generated values from a normal distribution with a mean 0 and standard deviation 1 and p is the number of covariates in the dataset. The hyperparameter set consists of ν_{1}, τ_{2}, η, ε, ζ, and ς. For this study, after considering a small range of candidate values, ν_{1} was set to 0.5, τ_{2} was set to 0.8, η was set to 0.008, ε was set to 1E-7, ζ was set to 1E-5, λ was set to 0.001 and, ς was set to 0.5. Steps three through six are repeated until a specified number of iterations is reached (800 for this study) or until L ( ψ | y , x ) i + i − L ( ψ | y , x ) i < ζ . In applying this algorithm, we need to include an adjustment for the elastic net penalty. When taking derivatives with respect to β, we adjust these functions by subtracting the derivatives of the elastic net penalty. This is not done for α or ϕ. As a result, when computing the derivatives for the β_{k}, where k references the iteration, we subtract from that derivative term ( λ / n ) ( ς β k + ( 1 − ς ) s i g n ( β k ) / 2 ) which leads to the derivatives for each β subject to the elastic net penalty. For each iteration, in addition to modifying β by subtracting a function of its derivative we also shrink the parameters by a factor of λ ( ς β k + ( 1 − ς ) s i g n ( β k ) / 2 ) . This method was implemented in the R programming environment [

For the proposed model, the standard errors of our parameter estimates are computed using a bootstrapping pairs design [

The bootstrap-t confidence interval method is used to construct confidence intervals. The bootstrap-t confidence intervals are of the form

[ β ^ k − t ^ ( 1 − α ) × s e ( β ^ k ) , β ^ k + t ^ ( 1 − α ) × s e ( β ^ k ) ] (13)

where

s e ( β ^ k ) = V ( β ^ k ) / B (14)

with V ( β ^ k ) being defined as

V ( β ^ k ) = 1 B − 1 ∑ b = 1 B ( β ^ ( . ) k − β ^ * k b ) 2 (15)

where k = 1 , 2 , ⋯ , p and

β ^ ( . ) k = 1 B ∑ b = 1 B β ^ k , b * , (16)

where β ^ k , b * denotes the estimate of β ^ k from the b^{th} bootstrapped resamples dataset. In addition, t ^ ( α ) is chosen from the standard normal distribution such that

∑ b = 1 B { Z * ( b ) ≤ t ^ ( α ) } / B = α , (17)

where Z * ( b ) is defined as

Z * ( b ) = β ^ * k , p − β ^ ( . ) k s e ( β ^ ( . ) k ) (18)

The R programming environment [

The simulation procedure used is the same as previously presented [

The proposed methodology, and ordinalgmifs method with the option probability, model = “Stereotype”, was applied to the simulated data. The goal was to compare two implementations of penalized stereotype logit models.

Tables 2-5 present the parameter estimates for the 10 significant parameters based on the proposed method. For all simulated datasets, the 10 non-significant parameters (not shown in the tables) had a maximum absolute value of 0.04; these values were close to 0. For the ordinalgmifs method, all non-significant parameters had estimates of 0. The proposed method selects the significant parameters that are truly related to the outcome while setting estimates of the non-significant parameters close to 0. In comparison, the ordinalgmifs methods set these values to 0. The confidence intervals are somewhat narrow (~0.5) for parameter estimates of significant covariates.

Proposed Method | Ordinalgmifs | |||
---|---|---|---|---|

Dataset (Correlation Type) | Accuracy (%) | Execution Time (Seconds) | Accuracy (%) | Execution Time (Seconds) |

Compound Symmetric | 96.1 (1.37)* | 23.63 (17.50)* | 93.51 (2.1)* | 187.38 (50.28)* |

First Order Autoregressive | 96.52 (1.46)* | 17.90 (3.23)* | 94.13 (2.19)* | 183.59 (39.30)* |

Toeplitz | 96.24 (1.33)* | 18.21 (3.89)* | 93.01 (2.46)* | 200.06 (55.93)* |

Unstructured | 96.49 (1.32)* | 17.21 (0.13)* | 93.95 (2.1)* | 166.98 (41.73)* |

For the two methods, accuracy and executions times were compared using a two-sided Welch’s two sample t test, with significance level of 0.05. “*” indicates a statistically significant difference.

Truly Important Variable | Parameter Estimate | 95% Confidence Interval |
---|---|---|

V1 | −1.548 | (−1.725, −1.372) |

V2 | −1.574 | (−1.754, −1.394) |

V3 | −1.62 | (−1.805, −1.434) |

V4 | −1.629 | (−1.816, −1443) |

V5 | −1.55 | (−1.727, −1.373) |

V6 | 1.625 | (1.44, 1.81) |

V7 | 1.618 | (1.433, 1.803) |

V8 | 1.639 | (1.452, 1.826) |

V9 | 1.701 | (1.506, 1.896) |

V10 | 1.691 | (1.497, 1.884) |

Truly Important Variable | Parameter Estimate | 95% Confidence Interval |
---|---|---|

V1 | 1.497 | (1.317, 1.676) |

V2 | 1.466 | (1.29, 1.642) |

V3 | 1.416 | (1.246, 1.585) |

V4 | 1.531 | (1.347, 1.715) |

V5 | 1.446 | (1.273, 1.62) |

V6 | −1.525 | (−1.709, −1.341) |

V7 | −1.533 | (−1.716, −1.349) |

V8 | −1.537 | (−1.723, −1.352) |

V9 | −1.587 | (−1.778, −1.397) |

V10 | −1.533 | (−1.717, −1.35) |

Truly Important Variable | Parameter Estimate | 95% Confidence Interval |
---|---|---|

V1 | 1.466 | (1.296, 1.636) |

V2 | 1.522 | (1.347, 1.698) |

V3 | 1.485 | (1.313, 1.657) |

V4 | 1.521 | (1.346, 1.697) |

V5 | 1.507 | (1.334, 1.681) |

V6 | −1.579 | (−1.761, −1.396) |

V7 | −1.547 | (−1.726, −1.368) |

V8 | −1.569 | (−1.75, −1.388) |

V9 | −1.582 | (−1.765, −1.399) |

V10 | −1.588 | (−1.771, −1.405) |

Truly Important Variable | Parameter Estimate | 95% Confidence Interval |
---|---|---|

V1 | −1.479 | (−1.653, −1.305) |

V2 | −1.602 | (−1.774 −1.43) |

V3 | −1.63 | (−1.808, −1.453) |

V4 | −1.708 | (−1.895, −1521) |

V5 | −1.7 | (−1.887, −1.514) |

V6 | 1.729 | (1.539, 1.919) |

V7 | 1.703 | (1.516, 1.889) |

V8 | 1.654 | (1.468, 1.84) |

V9 | 1.806 | (1.607, 2.005) |

V10 | 1.623 | (1.442, 1.804) |

The data came from a study titled “Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets” [

The raw data, DLBCL-A: data set and DLBCL-A: class labels, were downloaded from http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi [

Gene Name | Definition | Parameter Estimate | 95% Confidence Interval |
---|---|---|---|

TCF7 | transcription factor 7 | −2.666 | (−2.845, −2.486) |

GSTM2 | glutathione S-transferase mu 2 | −2.629 | (−2.813, −2.444) |

ITGB7 | integrin subunit beta 7 | −2.598 | (−2.756, −2.439) |

EEF1A1 | eukaryotic translation elongation factor 1 alpha 1 | −2.59 | (−2.839, −2.342) |

LOC220594 | NA | −2.572 | (−2.736, −2.408) |

DEK | DEK proto-oncogene | 2.57 | (2.446, 2.694) |

ITGAL | integrin subunit alpha L | −2.561 | (−2.717, −2.405) |

BIN1 | bridging integrator 1 | −2.556 | (−2.704, −2.409) |

RPL21 | ribosomal protein L21 | −2.503 | (−2.671, −2.335) |

RBPSUH | recombination signal binding protein for immunoglobulin kappa J region | −2.503 | (−2.65, −2.356) |

NCOA1 | nuclear receptor coactivator 1 | −2.494 | (−2.648, −2.34) |

MYCBP2 | MYC binding protein 2, E3 ubiquitin protein ligase | −2.47 | (−2.614, −2.326) |

A2M | alpha-2-macroglobulin | −2.416 | (−2.595, −2.238) |

IL10RA | interleukin 10 receptor subunit alpha | −2.413 | (−2.564, −2.262) |

SLC25A5 | solute carrier family 25 member 5 | −2.383 | (−2.539, −2.227) |

CCL21 | C−C motif chemokine ligand 21 | −2.378 | (−2.53, −2.227) |

KPNB1 | karyopherin subunit beta 1 | −2.377 | (−2.536, −2.217) |

COL9A2 | collagen type IX alpha 2 chain | −2.374 | (−2.552, −2.196) |

RPS21 | ribosomal protein S21 | −2.363 | (−2.524, −2.202) |

ACP1 | acid phosphatase 1 | −2.361 | (−2.501, −2.221) |

There are multiple hyperparameters in the elastic net constrained stereotype logit, optimal values for these must be explored as this has the potential to increase variable selection and classification capabilities. This is usually accomplished with a grid search and is an open problem in machine learning for multiple hyperparameters [

A bootstrap resampling procedure was used to estimate the 95% confidence intervals. The main drawback is the computational time required to produce the confidence intervals with 200 additional models being fit. It may be advisable to perform a closed form estimate of the parameter variance matrix [

Although the stereotype logit is considered by many a generalized linear model, it is not. As such, an optimal solution may not exist, or there may be inflexion points. As a result, different starting values may yield different solutions. In this study, applying the method to a given dataset does not exhibit a great deal of variation in results, and the results of the applied bootstrap procedure confirm this. To address this, we applied a variable initialization scheme proposed by He [

A proposed model for the elastic net penalized stereotype logit model, with optimization provided by the Adam optimizer, to analyze ordinal outcome data was presented. The proposed method was applied to simulated and NHL data with reported results. For the simulated data, variable selection was perfect, and only significant variables had parameter estimates not close to 0. The classifications ranged from 96.1% to 96.52% on the test datasets. For the NHL data, 73% were correctly classified. The 20 topmost genes in terms of absolute value of the coefficient of variation were presented. Our evaluation study shows that the proposed method outperforms the ordinalgmifs penalized stereotype logit model; no comparison could be made with the NHL data analysis as the ordinalgmifs implemented stereotype logit was not able to produce parameter estimates. This manuscript is an extension of previous work [

First and foremost, I would like to thank God from whom all blessings flow. I would also like to express special thanks to Timothy Wysocki, the Co-director of the Center for Healthcare Delivery Science, at Nemours Children’s Specialty Care for allowing me to work on this project.

The author declares no conflicts of interest regarding the publication of this paper.

Williams, A.A.A. (2019) Ordinal Outcome Modeling: The Application of the Adaptive Moment Estimation Optimizer to the Elastic Net Penalized Stereotype Logit. Journal of Data Analysis and Information Processing, 7, 14-27. https://doi.org/10.4236/jdaip.2019.71002