Complete results and additional material for the article “UnPART: PART without the 'partial' condition of it”

2018-11-05

 

This page contains the full tables related to the work presented in the article:

Igor Ibarguren, Jesús M. Pérez, Javier Muguerza, Ibai Gurrutxaga and Olatz Arbelaitz.  
"UnPART: PART without the 'partial' condition of it". Information Sciences (2018). Vol. 465, pp 505-522.

doi:10.1016/j.ins.2018.07.022

 First, we present the table with the characteristics for the 96 datasets used in this study, divided into three contexts.

Then, for each of the evaluation measures, we include the full tables of the results related to the different proposed PART-like algorithm and their base decision trees, companion figures for the statistical significance, and pair-wise p-value tables.

 

All the tables of results can be downloaded as an Excel document or as a CSV file.

 

Content

1. Datasets characteristics. 2

2. Results for the C4.5-based and CHAID*-based algorithms for all 6 metrics over each dataset context. 5

 

Index of Tables

Table 1. Description of standard datasets. 3

Table 2. Description of imbalanced datasets. 3

Table 3. Kappa values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets  5

Table 4. GM values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets  6

Table 5. GM values for the C4.5-based and CHAID*-based algorithms for each of the 33 SMOTE-preprocessed imbalanced datasets  6

Table 6. p-values adjusted with the Bergman-Hommel post-hoc procedure for the C4.5-based and CHAID*-based algorithms for the Kappa and GM measures. 7

Table 7. AUC values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets  8

Table 8. AUC values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets  9

Table 9. AUC values for the C4.5-based and CHAID*-based algorithms for each of the 33 SMOTE-preprocessed imbalanced datasets  10

Table 10. p-values adjusted with the Bergman-Hommel post-hoc procedure for the C4.5-based and CHAID*-based algorithms for the AUC measure. 11

Table 15. Length values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets  12

Table 16. Length values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets  13

Table 17. Length values for the C4.5-based and CHAID*-based algorithms for each of the 33 SMOTE-preprocessed imbalanced datasets. 14

Table 18. p-values adjusted with the Bergman-Hommel post-hoc procedure for the C4.5-based and CHAID*-based algorithms for the Length measure. 15

Table 11. Number of Rules multiplied by Length values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets. 15

Table 12. Number of Rules multiplied by Length values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets. 16

Table 13. Number of Rules values multiplied by Length for the C4.5-based and CHAID*-based algorithms for each of the 33 SMOTE-preprocessed imbalanced datasets. 17

Table 14. p-values adjusted with the Bergman-Hommel post-hoc procedure for the C4.5-based and CHAID*-based algorithms for the Number of Rules multiplied by Length. 18

Table 19. Time values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets  19

Table 20. Time values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets  20

Table 21. Time values for the C4.5-based and CHAID*-based algorithms for each of the 33 SMOTE-preprocessed imbalanced datasets. 21

Table 22. p-values adjusted with the Bergman-Hommel post-hoc procedure for the C4.5-based and CHAID*-based algorithms for the Time measure. 22

 

Index of Figures

Figure 1. Friedman Aligned Ranks for the C4.5-based and CHAID*-based algorithms for the Kappa and GM measures  8

Figure 2. Friedman Aligned Ranks for the C4.5-based and CHAID*-based algorithms for the AUC measure  12

Figure 4. Friedman Aligned Ranks for the C4.5-based and CHAID*-based algorithms for the Length measure  15

Figure 3. Friedman Aligned Ranks for the C4.5-based and CHAID*-based algorithms for the Number of Rules multiplied by Length  19

Figure 5. Friedman Aligned Ranks for the C4.5-based and CHAID*-based algorithms for the Time measure  22

 

1. Datasets characteristics

This section contains the tables with the characteristics for the 96 datasets from the KEEL repository used in this study. First we present the datasets from the first (Standard) context and then from the second (Imbalanced) context. SMOTE-preprocessed datasets have have the same characteristics as the datasets from Table 2, but the minority class oversampled until it has the majority class’ size.

 


Table 1. Description of standard datasets.¡Error! Marcador no definido.

Data set

#Ants

#Examples

#Classes

%min

%maj

Size Of Min. Class

Size of Maj. Class

lymphography

18

148

4

1.36%

54.73%

2

81

ecoli

7

336

8

0.6%

42.56%

2

143

car

6

1728

4

3.77%

70.03%

65

1210

nursery

8

1296

5

0.08%

33.34%

1

432

cleveland

13

297

5

4.38%

53.88%

13

160

zoo

17

101

7

3.97%

40.6%

4

41

glass

9

214

6

4.21%

35.52%

9

76

flare

10

1066

6

4.04%

31.06%

43

331

abalone

8

418

22

0.24%

16.51%

1

69

balance

4

625

3

7.84%

46.08%

49

288

dermatology

33

358

6

5.59%

31.01%

20

111

hepatitis

19

80

2

16.25%

83.75%

13

67

newthyroid

5

215

3

13.96%

69.77%

30

150

haberman

3

306

2

26.48%

73.53%

81

225

breast

9

277

2

29.25%

70.76%

81

196

german

20

1000

2

30%

70%

300

700

wisconsin

9

630

2

34.61%

65.4%

218

412

contraceptive

9

1473

3

22.61%

42.71%

333

629

 

tictactoe

9

958

2

34.66%

65.35%

332

626

pima

8

768

2

34.9%

65.11%

268

500

magic

10

1902

2

35.13%

64.88%

668

1234

wine

13

178

3

26.97%

39.89%

48

71

bupa

6

345

2

42.03%

57.98%

145

200

heart

13

270

2

44.45%

55.56%

120

150

australian

14

690

2

44.5%

55.51%

307

383

crx

15

653

2

45.33%

54.68%

296

357

vehicle

18

846

4

23.53%

25.77%

199

218

penbased

16

1100

10

9.55%

10.46%

105

115

ring

20

740

2

49.6%

50.41%

367

373

iris

4

150

3

33.34%

33.34%

50

50

Mean

11.77

638.93

4.27

21%

50%

139

319.93

Median

9.5

521.5

3

23%

54%

73

209

 

 

Table 2. Description of imbalanced datasets.

Data set

#Atts.

#Examples

Imbalance

Size Of Min. Class

Size of Maj. Class

Abalone19

8

4174

0.77%

32

4142

Yeast6

8

1484

2.49%

37

1447

Yeast5

8

1484

2.96%

44

1440

Yeast4

8

1484

3.43%

51

1433

Yeast2vs8

8

482

4.15%

20

462

Glass5

9

214

4.2%

9

205

Abalone9vs18

8

731

5.65%

41

690

Glass4

9

214

6.07%

13

201

Ecoli4

7

336

6.74%

23

313

Glass2

9

214

8.78%

19

195

Vowel0

13

988

9.01%

89

899

Page-blocks0

10

5472

10.23%

560

4912

Ecoli3

7

336

10.88%

37

299

Yeast3

8

1484

10.98%

163

1321

Glass6

9

214

13.55%

29

185

Segment0

19

2308

14.26%

329

1979

Ecoli2

7

336

15.48%

52

284

New-thyroid1

5

215

16.28%

35

180

New-thyroid2

5

215

16.89%

36

179

Ecoli1

7

336

22.92%

77

259

Vehicle0

18

846

23.64%

200

646

Glass0123vs456

9

214

23.83%

51

163

Haberman

3

306

27.42%

84

222

Vehicle1

18

846

28.37%

240

606

Vehicle2

18

846

28.37%

240

606

Vehicle3

18

846

28.37%

240

606

Yeast1

8

1484

28.91%

429

1055

Glass0

9

214

32.71%

70

144

Iris0

4

150

33.33%

50

100

Pima

8

768

34.84%

268

500

Ecoli0vs1

7

220

35%

77

143

Wisconsin

9

683

35%

239

444

Glass1

9

214

35.51%

76

138

Mean

9.39

919.94

17.61%

120

799.94

Median

8

482

15.48%

52

444

 


 

2. Results for the C4.5-based and CHAID*-based algorithms for all 6 metrics over each dataset context.

This section includes the full tables of the results related to the C4.5-based and CHAID*-based algorithms (UnPART, BFPART, PART, and C4.5/CHAID*) for the six performance metrics used in the study: Kappa, GM, AUC, Number of Rules, Length, and Time. For Length the unit of measurement is the number of decisions per rule or tree branch. Computational cost is measured in milliseconds. Numbers in bold indicate the best value for that particular dataset.     In these tables we have treated C4.5 as reference for all algorithms, as it is the base for UnPART_C45, BFPART_C45, PART_C45, and the CHAID* algorithm also takes elements from C4.5. Cells with gray background indicate algorithms performing better than C4.5.

The figures used in the article to show the results of the statistical significance only show if the differences are significant at the .05 level, but do not show the actual p-values. In order to complement those figures, we include a table with all the pair-wise p-values for the Bergman-Hommel test. In these tables, the algorithms are ordered according to their Friedman Aligned Rank, from lowest to highest.

We also include a Figure for each global analysis for statistical significance performed in the article. Unlike the figures in the article, these graphically represent the differences between the Friedman Aligned Ranks for each algorithm.

2.1 Results for Kappa and GM measures

 

Table 3. Kappa values for the C4.5-based and CHAID*-based algorithms for each of the 30 standard datasets

 

C4.5-based

CHAID*-based

UnPART

BFPART

PART

C4.5

UnPART

BFPART

PART

CHAID*

lymphography

.5593

.5758

.4688

.5367

.5354

.5764

.6374

.5304

ecoli

.7359

.6986

.7289

.7030

.6981

.7065

.6814

.6805

car

.9033

.9068

.8894

.7986

.8924

.8900

.8971

.9076

nursery

.8424

.8482

.8753

.8341

.8478

.8477

.8947

.8684

cleveland

.2789

.2187

.2444

.2257

.2738

.2390

.2366

.2256

zoo

.9215

.9215

.9215

.9215

.9218

.9218

.9218

.9398

glass

.5394

.5526

.6076

.5494

.5085

.5085

.5124

.5391

flare

.6668

.6751

.6470

.6676

.6707

.6559

.6613

.6893

abalone

.1035

.0918

.0683

.0962

.1314

.1326

.1497

.1533

balance

.6447

.6447

.6494

.5922

.6175

.6209

.6262

.5996

dermatology

.9226

.9226

.9226

.9045

.9335

.9335

.9300

.9383

hepatitis

.2520

.2520

.5485

.1115

.1330

.0899

.2285

.1908

newthyroid

.8795

.8795

.8525

.8519

.8618

.8618

.8618

.8686

haberman

.1186

.1186

.0924

.1521

.0265

.0265

.0265

.0265

breast

.2418

.2418

.2300

.2330

.2970

.2970

.2663

.2970

german

.3124

.3243

.3333

.3049

.3158

.3150

.2705

.2714

wisconsin

.8474

.8474

.8236

.8515

.8699

.8556

.8907

.8410

contraceptive

.2816

.2641

.2356

.2845

.2815

.2896

.2778

.2909

tictactoe

.7803

.7573

.8430

.6770

.6817

.6789

.7847

.7605

pima

.3814

.3814

.4078

.4175

.4051

.4051

.3893

.4043

magic

.5186

.5190

.4806

.5183

.5034

.4979

.4871

.5266

wine

.8969

.8969

.8971

.9222

.8795

.8795

.8966

.8797

bupa

.2867

.2701

.2594

.3124

.2613

.2613

.2424

.2444

heart

.5703

.5384

.5239

.5636

.6225

.6537

.6537

.5489

australian

.7156

.7063

.6742

.6886

.6845

.6851

.6681

.6837

crx

.7371

.7184

.6651

.7194

.7354

.7194

.7116

.7172

vehicle

.6421

.6248

.6452

.6185

.5789

.5707

.5969

.5902

penbased

.8818

.8889

.8737

.8838

.8331

.8311

.8355

.8373

ring

.7702

.7702

.7623

.7135

.7482

.7420

.7420

.7482

iris

.9100

.9100

.9000

.9000

.9200

.9200

.9200

.9200

Mean

.6047

.5989

.6024

.5851

.5890

.5871

.5966

.5906

Median

.6557

.6599

.6482

.6431

.6466

.6548

.6575

.6400

This table can be downloaded as an Excel document or as a CSV file by clicking on the following links xls and csv.

 

 

Table 4. GM values for the C4.5-based and CHAID*-based algorithms for each of the 33 imbalanced datasets

 

C4.5-based

CHAID*-based

UnPART

BFPART

PART

C4.5

UnPART

BFPART

PART

CHAID*

abalone19

.0000

.0000

.0000

.0000

.0000

.0000

.0000

.0000

yeast6

.5660

.5660

.7333

.5660

.1302

.1302

.2359

.1302

yeast5

.8846

.8742

.8850

.8612

.7967

.7957

.8601

.7715

yeast4

.4731

.4731

.4239

.4208

.3844

.3844

.2326

.3844

yeast-2_vs_8

.3137

.1704

.1704

.1723

.7274

.7274

.7264

.7274

glass5

.8787

.8787

.8787

.8804

.9876

.9876

.9876

.9876

abalone9-18

.4173

.3897

.4157

.3882

.2983

.2983

.2930

.2975

glass4

.5836

.5836

.5836

.5769

.3397

.3397

.3397

.3030

ecoli4

.7791

.7791

.7786

.7777

.7821

.7821

.8405

.8123

glass2

.4597

.4597

.4597

.4427

.0000

.0000

.0000

.0000

vowel0

.9149

.9149

.9207

.9683

.8552

.9112

.9194

.8722

page-blocks0

.9136

.8944

.8932

.9225

.8879

.9001

.9096

.9002

ecoli3

.7227

.7227

.7290

.6773

.7003

.7003

.7003

.6749

yeast3

.8567

.8567

.8097

.8479

.8537

.8537

.8552

.8670

glass6

.7940

.7940

.7874

.7940

.7946

.7946

.7946

.7920

segment0

.9858

.9858

.9893

.9814

.9794

.9796

.9867

.9829

ecoli2

.8041

.8041

.7963

.8497

.7564

.7564

.7564

.7564

new-thyroid1

.9394

.9394

.9394

.9460

.9529

.9529

.9529

.9503

new-thyroid2

.9206

.9206

.9206

.9327

.9328

.9328

.9328

.9355

ecoli1

.8590