Additional material for the article “CTCHAID: extending the application of the consolidation
methodology”
The tables in this section summarize the characteristics of each data
set used in the article. Table 1 refers to standard data sets while Table 2 refers to imbalanced data sets.
|
Data set |
#Atts. |
#Examples |
#Classes |
%min |
%maj |
Size Of
Min. Class |
Size of
Maj. Class |
|
lymphography |
18 |
148 |
4 |
1.36% |
54.73% |
2 |
81 |
|
ecoli |
7 |
336 |
8 |
0.6% |
42.56% |
2 |
143 |
|
car |
6 |
1728 |
4 |
3.77% |
70.03% |
65 |
1210 |
|
nursery |
8 |
1296 |
5 |
0.08% |
33.34% |
1 |
432 |
|
cleveland |
13 |
297 |
5 |
4.38% |
53.88% |
13 |
160 |
|
zoo |
17 |
101 |
7 |
3.97% |
40.6% |
4 |
41 |
|
glass |
9 |
214 |
6 |
4.21% |
35.52% |
9 |
76 |
|
flare |
10 |
1066 |
6 |
4.04% |
31.06% |
43 |
331 |
|
abalone |
8 |
418 |
22 |
0.24% |
16.51% |
1 |
69 |
|
balance |
4 |
625 |
3 |
7.84% |
46.08% |
49 |
288 |
|
dermatology |
33 |
358 |
6 |
5.59% |
31.01% |
20 |
111 |
|
hepatitis |
19 |
80 |
2 |
16.25% |
83.75% |
13 |
67 |
|
newthyroid |
5 |
215 |
3 |
13.96% |
69.77% |
30 |
150 |
|
haberman |
3 |
306 |
2 |
26.48% |
73.53% |
81 |
225 |
|
breast |
9 |
277 |
2 |
29.25% |
70.76% |
81 |
196 |
|
german |
20 |
1000 |
2 |
30% |
70% |
300 |
700 |
|
wisconsin |
9 |
630 |
2 |
34.61% |
65.4% |
218 |
412 |
|
contraceptive |
9 |
1473 |
3 |
22.61% |
42.71% |
333 |
629 |
|
tictactoe |
9 |
958 |
2 |
34.66% |
65.35% |
332 |
626 |
|
pima |
8 |
768 |
2 |
34.9% |
65.11% |
268 |
500 |
|
magic |
10 |
1902 |
2 |
35.13% |
64.88% |
668 |
1234 |
|
wine |
13 |
178 |
3 |
26.97% |
39.89% |
48 |
71 |
|
bupa |
6 |
345 |
2 |
42.03% |
57.98% |
145 |
200 |
|
heart |
13 |
270 |
2 |
44.45% |
55.56% |
120 |
150 |
|
australian |
14 |
690 |
2 |
44.5% |
55.51% |
307 |
383 |
|
crx |
15 |
653 |
2 |
45.33% |
54.68% |
296 |
357 |
|
vehicle |
18 |
846 |
4 |
23.53% |
25.77% |
199 |
218 |
|
penbased |
16 |
1100 |
10 |
9.55% |
10.46% |
105 |
115 |
|
ring |
20 |
740 |
2 |
49.6% |
50.41% |
367 |
373 |
|
iris |
4 |
150 |
3 |
33.34% |
33.34% |
50 |
50 |
|
Mean |
11.77 |
638.93 |
4.27 |
21% |
50% |
139 |
319.93 |
|
Median |
9.5 |
521.5 |
3 |
23% |
54% |
73 |
209 |
Table 1. Description of standard datasets.
|
Data set |
#Atts. |
#Examples |
Imbalance |
Size Of
Min. Class |
Size of
Maj. Class |
|
Abalone19 |
8 |
4174 |
0.77% |
32 |
4142 |
|
Yeast6 |
8 |
1484 |
2.49% |
37 |
1447 |
|
Yeast5 |
8 |
1484 |
2.96% |
44 |
1440 |
|
Yeast4 |
8 |
1484 |
3.43% |
51 |
1433 |
|
Yeast2vs8 |
8 |
482 |
4.15% |
20 |
462 |
|
Glass5 |
9 |
214 |
4.2% |
9 |
205 |
|
Abalone9vs18 |
8 |
731 |
5.65% |
41 |
690 |
|
Glass4 |
9 |
214 |
6.07% |
13 |
201 |
|
Ecoli4 |
7 |
336 |
6.74% |
23 |
313 |
|
Glass2 |
9 |
214 |
8.78% |
19 |
195 |
|
Vowel0 |
13 |
988 |
9.01% |
89 |
899 |
|
Page-blocks0 |
10 |
5472 |
10.23% |
560 |
4912 |
|
Ecoli3 |
7 |
336 |
10.88% |
37 |
299 |
|
Yeast3 |
8 |
1484 |
10.98% |
163 |
1321 |
|
Glass6 |
9 |
214 |
13.55% |
29 |
185 |
|
Segment0 |
19 |
2308 |
14.26% |
329 |
1979 |
|
Ecoli2 |
7 |
336 |
15.48% |
52 |
284 |
|
New-thyroid1 |
5 |
215 |
16.28% |
35 |
180 |
|
New-thyroid2 |
5 |
215 |
16.89% |
36 |
179 |
|
Ecoli1 |
7 |
336 |
22.92% |
77 |
259 |
|
Vehicle0 |
18 |
846 |
23.64% |
200 |
646 |
|
Glass0123vs456 |
9 |
214 |
23.83% |
51 |
163 |
|
Haberman |
3 |
306 |
27.42% |
84 |
222 |
|
Vehicle1 |
18 |
846 |
28.37% |
240 |
606 |
|
Vehicle2 |
18 |
846 |
28.37% |
240 |
606 |
|
Vehicle3 |
18 |
846 |
28.37% |
240 |
606 |
|
Yeast1 |
8 |
1484 |
28.91% |
429 |
1055 |
|
Glass0 |
9 |
214 |
32.71% |
70 |
144 |
|
Iris0 |
4 |
150 |
33.33% |
50 |
100 |
|
Pima |
8 |
768 |
34.84% |
268 |
500 |
|
Ecoli0vs1 |
7 |
220 |
35% |
77 |
143 |
|
Wisconsin |
9 |
683 |
35% |
239 |
444 |
|
Glass1 |
9 |
214 |
35.51% |
76 |
138 |
|
Mean |
9.39 |
919.94 |
17.61% |
120 |
799.94 |
|
Median |
8 |
482 |
15.48% |
52 |
444 |
Table 2. Description of imbalanced datasets.
The tables in this section show the number of subsamples for a coverage
value of 99%. Table 3 refers to standard datasets and Table 45 refers to imbalanced datasets.
For standard datasets the MinCover column
represent the minimum number of examples of each class as stated by the rule
and exceptions in the methodology section of the article. The data sets where
the size of classes in the subsamples is enforced by the MinCover as opposed to
the size of the minority class are stressed in bold.
For imbalanced data sets preprocessed with
SMOTE, only the total example number and the size of the minority class change
from the data sets without the preprocessing. In these data sets the minority
class has been oversampled with SMOTE until it has the same size as the
majority class.
|
|
Original |
Training
sample |
Subsample |
|
Coverage 99% |
|||||
|
Data set |
Size |
#Class |
%Min |
Size |
N_S |
MinCover[1] |
Maj.
Class Size |
Size |
|
N_S |
|
|
|
|
|
|
||||||
|
lymphography |
148 |
4 |
1.36% |
119 |
2 |
2 |
66 |
12 |
4.55% |
99 |
|
ecoli |
336 |
8 |
0.6% |
269 |
2 |
3 |
115 |
48 |
5.22% |
86 |
|
car |
1728 |
4 |
3.77% |
1383 |
53 |
14 |
969 |
212 |
5.47% |
82 |
|
nursery |
1296 |
5 |
0.08% |
1037 |
1 |
11 |
346 |
105 |
6.07% |
74 |
|
cleveland |
297 |
5 |
4.38% |
238 |
11 |
3 |
129 |
55 |
8.53% |
52 |
|
zoo |
101 |
7 |
3.97% |
81 |
4 |
1 |
33 |
28 |
12.13% |
36 |
|
glass |
214 |
6 |
4.21% |
172 |
8 |
2 |
62 |
48 |
12.91% |
34 |
|
flare |
1066 |
6 |
4.04% |
853 |
35 |
9 |
265 |
210 |
13.21% |
33 |
|
abalone |
418 |
22 |
0.24% |
335 |
1 |
4 |
56 |
154 |
12.5% |
35 |
|
balance |
625 |
3 |
7.84% |
500 |
40 |
5 |
231 |
120 |
17.32% |
25 |
|
dermatology |
358 |
6 |
5.59% |
287 |
17 |
3 |
89 |
102 |
19.11% |
22 |
|
hepatitis |
80 |
2 |
16.25% |
64 |
11 |
1 |
54 |
22 |
20.38% |
21 |
|
newthyroid |
215 |
3 |
13.96% |
172 |
24 |
2 |
120 |
72 |
20% |
21 |
|
haberman |
306 |
2 |
26.48% |
245 |
65 |
3 |
181 |
130 |
35.92% |
11 |
|
breast |
277 |
2 |
29.25% |
222 |
65 |
3 |
158 |
130 |
41.14% |
9 |
|
german |
1000 |
2 |
30% |
800 |
240 |
8 |
560 |
480 |
42.86% |
9 |
|
wisconsin |
630 |
2 |
34.61% |
504 |
175 |
6 |
330 |
350 |
53.04% |
7 |
|
contraceptive |
1473 |
3 |
22.61% |
1179 |
267 |
12 |
504 |
801 |
52.98% |
7 |
|
tictactoe |
958 |
2 |
34.66% |
767 |
266 |
8 |
502 |
532 |
52.99% |
7 |
|
pima |
768 |
2 |
34.9% |
615 |
215 |
7 |
401 |
430 |
53.62% |
6 |
|
magic |
1902 |
2 |
35.13% |
1522 |
535 |
16 |
988 |
1070 |
54.15% |
6 |
|
wine |
178 |
3 |
26.97% |
143 |
39 |
2 |
58 |
117 |
67.25% |
5 |
|
bupa |
345 |
2 |
42.03% |
276 |
116 |
3 |
160 |
232 |
72.5% |
4 |
|
heart |
270 |
2 |
44.45% |
216 |
96 |
3 |
120 |
192 |
80% |
3 |
|
australian |
690 |
2 |
44.5% |
552 |
246 |
6 |
307 |
492 |
80.14% |
3 |
|
crx |
653 |
2 |
45.33% |
523 |
238 |
6 |
286 |
476 |
83.22% |
3 |
|
vehicle |
846 |
4 |
23.53% |
677 |
160 |
7 |
175 |
640 |
91.43% |
3 |
|
penbased |
1100 |
10 |
9.55% |
880 |
84 |
9 |
92 |
840 |
91.31% |
3 |
|
ring |
740 |
2 |
49.6% |
592 |
294 |
6 |
299 |
588 |
98.33% |
3 |
|
iris[2] |
150 |
3 |
33.34% |
120 |
40 |
2 |
40 |
66 |
55% |
6 |
|
Mean |
638.94 |
4.27 |
22% |
511.44 |
111.67 |
5.57 |
256.54 |
291.8 |
43% |
24 |
|
Median |
521.5 |
3 |
23.07% |
417.5 |
59 |
4.5 |
167.5 |
173 |
42% |
9 |
Table 3. Subsamble numbers for standard datasets to
achieve a coverage of 99%.
|
|
Original |
Training sample |
Subsample |
|
Coverage 99% |
|||
|
Data set |
Size |
%Min |
Size |
Min.
Class Size |
Maj.
Class Size |
Size |
|
N_S |
|
|
|
|
|
|
||||
|
Abalone19 |
4174 |
0.77 |
3340 |
26 |
3314 |
52 |
0.78% |
585 |
|
Yeast6 |
1484 |
2.49 |
1188 |
30 |
1158 |
60 |
2.59% |
176 |
|
Yeast5 |
1484 |
2.96 |
1189 |
36 |
1153 |
72 |
3.12% |
146 |
|
Yeast4 |
1484 |
3.43 |
1188 |
41 |
1147 |
82 |
3.57% |
127 |
|
Yeast2vs8 |
482 |
4.15 |
387 |
17 |
370 |
34 |
4.59% |
98 |
|
Glass5 |
214 |
4.2 |
173 |
8 |
165 |
16 |
4.85% |
93 |
|
Abalone9vs18 |
731 |
5.65 |
586 |
34 |
552 |
68 |
6.16% |
73 |
|
Glass4 |
214 |
6.07 |
172 |
11 |
161 |
22 |
6.83% |
66 |
|
Ecoli4 |
336 |
6.74 |
270 |
19 |
251 |
38 |
7.57% |
59 |
|
Glass2 |
214 |
8.78 |
173 |
16 |
157 |
32 |
10.19% |
43 |
|
Vowel0 |
988 |
9.01 |
792 |
72 |
720 |
144 |
10% |
44 |
|
Page-blocks0 |
5472 |
10.23 |
4378 |
448 |
3930 |
896 |
11.4% |
39 |
|
Ecoli3 |
336 |
10.88 |
270 |
30 |
240 |
60 |
12.5% |
35 |
|
Yeast3 |
1484 |
10.98 |
1188 |
131 |
1057 |
262 |
12.39% |
35 |
|
Glass6 |
214 |
13.55 |
173 |
24 |
149 |
48 |
16.11% |
27 |
|
Segment0 |
2308 |
14.26 |
1848 |
264 |
1584 |
528 |
16.67% |
26 |
|
Ecoli2 |
336 |
15.48 |
270 |
42 |
228 |
84 |
18.42% |
23 |
|
New-thyroid1 |
215 |
16.28 |
173 |
29 |
144 |
58 |
20.14% |
21 |
|
New-thyroid2 |
215 |
16.89 |
173 |
30 |
143 |
60 |
20.98% |
20 |
|
Ecoli1 |
336 |
22.92 |
270 |
62 |
208 |
124 |
29.81% |
14 |
|
Vehicle0 |
846 |
23.64 |
677 |
160 |
517 |
320 |
30.95% |
13 |
|
Glass0123vs456 |
214 |
23.83 |
172 |
41 |
131 |
82 |
31.3% |
13 |
|
Haberman |
306 |
27.42 |
246 |
68 |
178 |
136 |
38.2% |
10 |
|
Vehicle1 |
846 |
28.37 |
678 |
193 |
485 |
386 |
39.79% |
10 |
|
Vehicle2 |
846 |
28.37 |
678 |
193 |
485 |
386 |
39.79% |
10 |
|
Vehicle3 |
846 |
28.37 |
678 |
193 |
485 |
386 |
39.79% |
10 |
|
Yeast1 |
1484 |
28.91 |
1188 |
344 |
844 |
688 |
40.76% |
9 |
|
Glass0 |
214 |
32.71 |
172 |
56 |
116 |
112 |
48.28% |
7 |
|
Iris0 |
150 |
33.33 |
121 |
40 |
81 |
80 |
49.38% |
7 |
|
Pima |
768 |
34.84 |
616 |
215 |
401 |
430 |
53.62% |
6 |
|
Ecoli0vs1 |
220 |
35 |
177 |
62 |
115 |
124 |
53.91% |
6 |
|
Wisconsin |
683 |
35 |
548 |
192 |
356 |
384 |
53.93% |
6 |
|
Glass1 |
214 |
35.51 |
172 |
61 |
111 |
122 |
54.95% |
6 |
|
Mean |
919.94 |
17.61 |
737.09 |
96.61 |
640.48 |
193.21 |
24.04% |
56 |
|
Median |
482 |
15.48 |
387 |
42 |
356 |
84 |
18.42% |
23 |
Table 4. Subsamble numbers for imbalance datasets to
achieve a coverage of 99%.
|
Algorithm |
Accuracy |
Kappa |
|
XCS |
77.81±4.12 |
58.66 ±8.74 |
|
SIA |
74.65±3.58 |
52.37 ±6.96 |
|
OCEC |
70.42±4.67 |
48.40 ±8.10 |
|
GAssist |
77.78±3.71 |
59.53 ±7.31 |
|
Oblique-DT |
76.58±3.34 |
57.79 ±6.47 |
|
CART |
73.91±3.91 |
51.97 ±7.44 |
|
AQ |
67.77±5.18 |
42.40 ±9.71 |
|
CN2 |
72.80±3.51 |
44.84 ±7.46 |
|
C4.5 |
77.93±3.19 |
58.51±5.82 |
|
C4.5-Rules |
76.59±4.10 |
57.17 ±7.79 |
|
Ripper |
73.96±3.24 |
56.43 ±7.12 |
|
CHAID* |
78.58±3.08 |
59.75±6.41 |
|
CTC45 |
77.69±3.49 |
58.16±6.96 |
|
CTCHAID |
76.45±3.77 |
56.84±7.08 |
Table 5. Average performance values for all algorithms
on standard datasets.
|
Algorithm |
GM |
|
UCS |
64.92±16.55 |
|
SIA |
69.92±9.49 |
|
OCEC |
70.88±10.38 |
|
GAssist |
67.58±9.69 |
|
Oblique-DT |
75.81±8.55 |
|
CART |
69.72±11.51 |
|
AQ |
59.81±8.72 |
|
CN2 |
45.97±12.09 |
|
C4.5 |
73.49±9.52 |
|
C4.5-Rules |
75.21±10.23 |
|
Ripper |
79.34±8.98 |
|
CHAID* |
69.10±8.76 |
|
CTC45 |
79.99±8.02 |
|
CTCHAID |
80.93±7.70 |
Table 6. Average performance values for all algorithms
on imbalanced datasets.
|
Algorithm |
GM |
|
XCS |
84.92±5.69 |
|
SIA |
81.79±6.56 |
|
CORE |
76.19±8.77 |
|
GAssist |
83.69±7.26 |
|
DT-GA |
80.61±7.66 |
|
CART |
70.55±11.69 |
|
AQ |
52.52±7.61 |
|
CN2 |
63.61±10.39 |
|
C4.5 |
80.68±6.09 |
|
C4.5-Rules |
81.79±7.36 |
|
Ripper |
79.48±10.56 |
|
CHAID* |
80.65±6.14 |
|
CTC45 |
83.54±6.84 |
|
CTCHAID |
82.99±6.99 |
Table 7. Average performance values for all algorithms
on standard datasets.
For the
sake of replicability we publish the average results obtained by CTC45 and
CTCHAID on all three classification contexts, dataset by dataset.