Additional material for the article “CTCHAID: extending the application of the consolidation methodology

09/11/2015

1  Data set characteristics

The tables in this section summarize the characteristics of each data set used in the article. Table 1 refers to standard data sets while Table 2 refers to imbalanced data sets.

 

 Data set

#Atts.

#Examples

#Classes

%min

%maj

Size Of Min. Class

Size of Maj. Class

 lymphography

18

148

4

1.36%

54.73%

2

81

ecoli

7

336

8

0.6%

42.56%

2

143

car

6

1728

4

3.77%

70.03%

65

1210

nursery

8

1296

5

0.08%

33.34%

1

432

cleveland

13

297

5

4.38%

53.88%

13

160

zoo

17

101

7

3.97%

40.6%

4

41

glass

9

214

6

4.21%

35.52%

9

76

flare

10

1066

6

4.04%

31.06%

43

331

abalone

8

418

22

0.24%

16.51%

1

69

balance

4

625

3

7.84%

46.08%

49

288

dermatology

33

358

6

5.59%

31.01%

20

111

hepatitis

19

80

2

16.25%

83.75%

13

67

newthyroid

5

215

3

13.96%

69.77%

30

150

haberman

3

306

2

26.48%

73.53%

81

225

breast

9

277

2

29.25%

70.76%

81

196

german

20

1000

2

30%

70%

300

700

wisconsin

9

630

2

34.61%

65.4%

218

412

contraceptive

9

1473

3

22.61%

42.71%

333

629

tictactoe

9

958

2

34.66%

65.35%

332

626

pima

8

768

2

34.9%

65.11%

268

500

magic

10

1902

2

35.13%

64.88%

668

1234

wine

13

178

3

26.97%

39.89%

48

71

bupa

6

345

2

42.03%

57.98%

145

200

heart

13

270

2

44.45%

55.56%

120

150

australian

14

690

2

44.5%

55.51%

307

383

crx

15

653

2

45.33%

54.68%

296

357

vehicle

18

846

4

23.53%

25.77%

199

218

penbased

16

1100

10

9.55%

10.46%

105

115

ring

20

740

2

49.6%

50.41%

367

373

iris

4

150

3

33.34%

33.34%

50

50

 Mean

11.77

638.93

4.27

21%

50%

139

319.93

Median

9.5

521.5

3

23%

54%

73

209

Table 1. Description of standard datasets.

 

 Data set

#Atts.

#Examples

Imbalance

Size Of Min. Class

Size of Maj. Class

 Abalone19

8

4174

0.77%

32

4142

Yeast6

8

1484

2.49%

37

1447

Yeast5

8

1484

2.96%

44

1440

Yeast4

8

1484

3.43%

51

1433

Yeast2vs8

8

482

4.15%

20

462

Glass5

9

214

4.2%

9

205

Abalone9vs18

8

731

5.65%

41

690

Glass4

9

214

6.07%

13

201

Ecoli4

7

336

6.74%

23

313

Glass2

9

214

8.78%

19

195

Vowel0

13

988

9.01%

89

899

Page-blocks0

10

5472

10.23%

560

4912

Ecoli3

7

336

10.88%

37

299

Yeast3

8

1484

10.98%

163

1321

Glass6

9

214

13.55%

29

185

Segment0

19

2308

14.26%

329

1979

Ecoli2

7

336

15.48%

52

284

New-thyroid1

5

215

16.28%

35

180

New-thyroid2

5

215

16.89%

36

179

Ecoli1

7

336

22.92%

77

259

Vehicle0

18

846

23.64%

200

646

Glass0123vs456

9

214

23.83%

51

163

Haberman

3

306

27.42%

84

222

Vehicle1

18

846

28.37%

240

606

Vehicle2

18

846

28.37%

240

606

Vehicle3

18

846

28.37%

240

606

Yeast1

8

1484

28.91%

429

1055

Glass0

9

214

32.71%

70

144

Iris0

4

150

33.33%

50

100

Pima

8

768

34.84%

268

500

Ecoli0vs1

7

220

35%

77

143

Wisconsin

9

683

35%

239

444

Glass1

9

214

35.51%

76

138

 Mean

9.39

919.94

17.61%

120

799.94

Median

8

482

15.48%

52

444

Table 2. Description of imbalanced datasets.

2  Subsample numbers by data set to achieve the selected coverage values

The tables in this section show the number of subsamples for a coverage value of 99%. Table 3 refers to standard datasets and Table 45 refers to imbalanced datasets.

For standard datasets the MinCover column represent the minimum number of examples of each class as stated by the rule and exceptions in the methodology section of the article. The data sets where the size of classes in the subsamples is enforced by the MinCover as opposed to the size of the minority class are stressed in bold.

For imbalanced data sets preprocessed with SMOTE, only the total example number and the size of the minority class change from the data sets without the preprocessing. In these data sets the minority class has been oversampled with SMOTE until it has the same size as the majority class.

 

 

Original

Training sample

Subsample

 

Coverage 99%

Data set

Size

#Class

%Min

Size

N_S

MinCover[1]

Maj. Class Size

Size

N_S

 

 

 

 

 

 lymphography

148

4

1.36%

119

2

2

66

12

4.55%

99

ecoli

336

8

0.6%

269

2

3

115

48

5.22%

86

car

1728

4

3.77%

1383

53

14

969

212

5.47%

82

nursery

1296

5

0.08%

1037

1

11

346

105

6.07%

74

cleveland

297

5

4.38%

238

11

3

129

55

8.53%

52

zoo

101

7

3.97%

81

4

1

33

28

12.13%

36

glass

214

6

4.21%

172

8

2

62

48

12.91%

34

flare

1066

6

4.04%

853

35

9

265

210

13.21%

33

abalone

418

22

0.24%

335

1

4

56

154

12.5%

35

balance

625

3

7.84%

500

40

5

231

120

17.32%

25

dermatology

358

6

5.59%

287

17

3

89

102

19.11%

22

hepatitis

80

2

16.25%

64

11

1

54

22

20.38%

21

newthyroid

215

3

13.96%

172

24

2

120

72

20%

21

haberman

306

2

26.48%

245

65

3

181

130

35.92%

11

breast

277

2

29.25%

222

65

3

158

130

41.14%

9

german

1000

2

30%

800

240

8

560

480

42.86%

9

wisconsin

630

2

34.61%

504

175

6

330

350

53.04%

7

contraceptive

1473

3

22.61%

1179

267

12

504

801

52.98%

7

tictactoe

958

2

34.66%

767

266

8

502

532

52.99%

7

pima

768

2

34.9%

615

215

7

401

430

53.62%

6

magic

1902

2

35.13%

1522

535

16

988

1070

54.15%

6

wine

178

3

26.97%

143

39

2

58

117

67.25%

5

bupa

345

2

42.03%

276

116

3

160

232

72.5%

4

heart

270

2

44.45%

216

96

3

120

192

80%

3

australian

690

2

44.5%

552

246

6

307

492

80.14%

3

crx

653

2

45.33%

523

238

6

286

476

83.22%

3

vehicle

846

4

23.53%

677

160

7

175

640

91.43%

3

penbased

1100

10

9.55%

880

84

9

92

840

91.31%

3

ring

740

2

49.6%

592

294

6

299

588

98.33%

3

iris[2]

150

3

33.34%

120

40

2

40

66

55%

6

 Mean

638.94

4.27

22%

511.44

111.67

5.57

256.54

291.8

43%

24

Median

521.5

3

23.07%

417.5

59

4.5

167.5

173

42%

9

Table 3. Subsamble numbers for standard datasets to achieve a coverage of 99%.

 

 

Original

Training sample

Subsample

 

Coverage 99%

Data set

Size

%Min

Size

Min. Class Size

Maj. Class Size

Size

N_S

 

 

 

 

 

 Abalone19

4174

0.77

3340

26

3314

52

0.78%

585

Yeast6

1484

2.49

1188

30

1158

60

2.59%

176

Yeast5

1484

2.96

1189

36

1153

72

3.12%

146

Yeast4

1484

3.43

1188

41

1147

82

3.57%

127

Yeast2vs8

482

4.15

387

17

370

34

4.59%

98

Glass5

214

4.2

173

8

165

16

4.85%

93

Abalone9vs18

731

5.65

586

34

552

68

6.16%

73

Glass4

214

6.07

172

11

161

22

6.83%

66

Ecoli4

336

6.74

270

19

251

38

7.57%

59

Glass2

214

8.78

173

16

157

32

10.19%

43

Vowel0

988

9.01

792

72

720

144

10%

44

Page-blocks0

5472

10.23

4378

448

3930

896

11.4%

39

Ecoli3

336

10.88

270

30

240

60

12.5%

35

Yeast3

1484

10.98

1188

131

1057

262

12.39%

35

Glass6

214

13.55

173

24

149

48

16.11%

27

Segment0

2308

14.26

1848

264

1584

528

16.67%

26

Ecoli2

336

15.48

270

42

228

84

18.42%

23

New-thyroid1

215

16.28

173

29

144

58

20.14%

21

New-thyroid2

215

16.89

173

30

143

60

20.98%

20

Ecoli1

336

22.92

270

62

208

124

29.81%

14

Vehicle0

846

23.64

677

160

517

320

30.95%

13

Glass0123vs456

214

23.83

172

41

131

82

31.3%

13

Haberman

306

27.42

246

68

178

136

38.2%

10

Vehicle1

846

28.37

678

193

485

386

39.79%

10

Vehicle2

846

28.37

678

193

485

386

39.79%

10

Vehicle3

846

28.37

678

193

485

386

39.79%

10

Yeast1

1484

28.91

1188

344

844

688

40.76%

9

Glass0

214

32.71

172

56

116

112

48.28%

7

Iris0

150

33.33

121

40

81

80

49.38%

7

Pima

768

34.84

616

215

401

430

53.62%

6

Ecoli0vs1

220

35

177

62

115

124

53.91%

6

Wisconsin

683

35

548

192

356

384

53.93%

6

Glass1

214

35.51

172

61

111

122

54.95%

6

 Mean

919.94

17.61

737.09

96.61

640.48

193.21

24.04%

56

Median

482

15.48

387

42

356

84

18.42%

23

Table 4. Subsamble numbers for imbalance datasets to achieve a coverage of 99%.

4  Average Results achieved by the GBML algorithms, classical algorithms, CTC45 and CTCHAID.

 

Algorithm

Accuracy

Kappa

XCS

77.81±4.12

58.66 ±8.74

SIA

74.65±3.58

52.37 ±6.96

OCEC

70.42±4.67

48.40 ±8.10

GAssist

77.78±3.71

59.53 ±7.31

Oblique-DT

76.58±3.34

57.79 ±6.47

CART

73.91±3.91

51.97 ±7.44

AQ

67.77±5.18

42.40 ±9.71

CN2

72.80±3.51

44.84 ±7.46

C4.5

77.93±3.19

58.51±5.82

C4.5-Rules

76.59±4.10

57.17 ±7.79

Ripper

73.96±3.24

56.43 ±7.12

CHAID*

78.58±3.08

59.75±6.41

CTC45

77.69±3.49

58.16±6.96

CTCHAID

76.45±3.77

56.84±7.08

Table 5. Average performance values for all algorithms on standard datasets.

 

Algorithm

GM

UCS

64.92±16.55

SIA

69.92±9.49

OCEC

70.88±10.38

GAssist

67.58±9.69

Oblique-DT

75.81±8.55

CART

69.72±11.51

AQ

59.81±8.72

CN2

45.97±12.09

C4.5

73.49±9.52

C4.5-Rules

75.21±10.23

Ripper

79.34±8.98

CHAID*

69.10±8.76

CTC45

79.99±8.02

CTCHAID

80.93±7.70

Table 6. Average performance values for all algorithms on imbalanced datasets.

 

Algorithm

GM

XCS

84.92±5.69

SIA

81.79±6.56

CORE

76.19±8.77

GAssist

83.69±7.26

DT-GA

80.61±7.66

CART

70.55±11.69

AQ

52.52±7.61

CN2

63.61±10.39

C4.5

80.68±6.09

C4.5-Rules

81.79±7.36

Ripper

79.48±10.56

CHAID*

80.65±6.14

CTC45

83.54±6.84

CTCHAID

82.99±6.99

Table 7. Average performance values for all algorithms on imbalanced datasets preprocessed with SMOTE.

 

For the sake of replicability we publish the average results obtained by CTC45 and CTCHAID on all three classification contexts, dataset by dataset.

 



[1] Minimum number of examples to cover from each class in any subsample (2% of the training sample size).

[2] The iris data set is an exception. It is already balanced. Subsamples are smaller than usual.