FeatureFinder

FeatureFinder is designed to give comprehensive and accurate sets of features which can be used in modelling, either to build a new model or to enhance and diagnose an existing model. Both methods are available through a single function, FindFeatures. The following give examples for each method.

Identifying Features for a model target

If you have not yet built a model, findFeatures can be used to identify promising features. A typical modelling scenario involves a table consisting of a set of predictors {x_i} and a model target y.

Identifying Features for a model residual

If you already have a model, and want to find features to further improve it, findFeatures can be (repeatedly) run using residuals. A typical modelling scenario involves a table consisting of a set of predictors {x_i} and a model residual r = y − p, where p is a prediction from a previously-fit model and y is the model target.

Scanning over partitions of the data

The function generates a decision tree for the entire table, as well as decision trees for every possible subset of the table. Subsets are defined using factor-valued columns in the data. Fators can either be user-defined or already included as predictors. The more factors that are created in the data, the more partitions that will be tested and so it is helpful to create a comprehensive set of factors for this purpose. One easy way to create factors is to bin each predictor into 10 bands and create a factor for each. Factor labels should be prefixed with a string character so that they are interpreted as factors and not numerics, for example “s1”, “s2” rather than “1”,“2” labels.

The feature-finding process

For the case where no model has been fitted yet, we simply define r = y, and for the case when a model has been fitted already we use the residual r = y − p. We supply a single table consisting of all predictors together with residual = r, actual = y, expected = p and call the findFeatures function.

Each decision tree will consist of an rpart tree as shown in the following example. Leaves are labelled with a residual and a leaf volume n. Nodes are also labelled with the cut-rule for the node, and these are used to identify the leaves. The code scans each leaf for cases with sufficiently high volume and a residual value which exceeds a user-specified threshold. These parameters are outlined in help(findFeatures). When a leaf meeting the criteria is found, it is printed in the txt file for the partition being scanned. These text files can then be manually or automatically parsed and included in models as required. Often a leaf will be a clue rather than the final form of a feature, and so manual inspection can be of assistance.

The summary of residual nodes according to user-specified criteria for residual value and leaf volume will be generated in txt files (for example treesAll.txt and allfactors\.[partitionvariable]factor.txt). These contain a summary of each significant term with its definition, volume and other parameters as shown:

Partition variable (categorical factor)
The level of the partition variable
Residual
Leaf volume vs partition volume (percentage)
Leaf volume
Partition volume
Expected value average in leaf
Actual value average in leaf
Residual (checkval)
Leaf definition within this partition of the partition variable

In the examples, partitioning enables significant leaves to be found for each partition, although the full dataset does not yield leaves in the fitted tree. This illustrates the benefits of the partitioning technique.

Standard dataset

Once features are found, they can be customised and added to the model as shown:

library(featurefinder)
data(mpgdata)
data=mpgdata

# define some categorical factors here, for use in partition scanning. Define as many as desired.
data$transfactor=paste("trans",as.matrix(data$trans),sep="")
data$transfactor=as.factor(data$transfactor)

# define data dimensions
n=dim(data)[1] # total dimension
nn=floor(dim(data)[1]/2) # split point for training and test
data=as.data.frame(data)
nm=names(data)
nm[8]='y' ## select a column to be the target of the model
names(data)=nm

data0=data # retain full dataset
data=data[c("manufacturer","displ","year","transfactor","y") ] # select a subset for our first model

firstmodel=lm(formula=y ~ .,data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)

## 
## Call:
## lm(formula = y ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9945 -0.9118 -0.0586  0.8697 13.0827 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -132.89410   89.52660  -1.484 0.139214    
## manufacturerchevrolet         3.71471    0.99010   3.752 0.000228 ***
## manufacturerdodge             0.32889    0.82052   0.401 0.688960    
## manufacturerford              1.53629    0.87105   1.764 0.079247 .  
## manufacturerhonda             5.02595    0.94628   5.311 2.79e-07 ***
## manufacturerhyundai           0.80977    0.86008   0.942 0.347537    
## manufacturerjeep              1.45035    1.09554   1.324 0.187002    
## manufacturerland rover       -1.63853    1.31376  -1.247 0.213725    
## manufacturerlincoln           1.09111    1.59582   0.684 0.494907    
## manufacturermercury           0.76484    1.34193   0.570 0.569323    
## manufacturernissan            1.83283    0.86219   2.126 0.034700 *  
## manufacturerpontiac           3.04917    1.26452   2.411 0.016763 *  
## manufacturersubaru            1.23808    0.89394   1.385 0.167547    
## manufacturertoyota            2.01600    0.71210   2.831 0.005095 ** 
## manufacturervolkswagen        2.69745    0.69932   3.857 0.000153 ***
## displ                        -2.39932    0.20642 -11.623  < 2e-16 ***
## year                          0.07908    0.04459   1.773 0.077627 .  
## transfactortransauto(l3)     -0.32843    2.02987  -0.162 0.871620    
## transfactortransauto(l4)     -2.32296    1.20724  -1.924 0.055696 .  
## transfactortransauto(l5)     -2.82813    1.17031  -2.417 0.016531 *  
## transfactortransauto(l6)     -2.05385    1.47940  -1.388 0.166530    
## transfactortransauto(s4)     -0.84376    1.79665  -0.470 0.639112    
## transfactortransauto(s5)     -1.49049    1.67264  -0.891 0.373906    
## transfactortransauto(s6)     -1.86670    1.23773  -1.508 0.133029    
## transfactortransmanual(m5)   -1.41601    1.19345  -1.186 0.236786    
## transfactortransmanual(m6)   -1.91586    1.18808  -1.613 0.108353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.216 on 208 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7289 
## F-statistic: 26.06 on 25 and 208 DF,  p-value: < 2.2e-16

# drop terms that are not significant and refit model
data$manufacturerchevrolet=(data$manufacturer=='chevrolet')
data$manufacturerford=(data$manufacturer=='ford')
data$manufacturerhonda=(data$manufacturer=='honda')
data$manufacturernissan=(data$manufacturer=='nissan')
data$manufacturerpontiac=(data$manufacturer=='pontiac')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$manufacturervolkswagen=(data$manufacturer=='volkswagen')
#data$displ
data$transfactortransautol4=(data$transfactor=='transauto(l4)')
data$transfactortransautol5=(data$transfactor=='transauto(l5)')
firstmodel=lm(formula=y ~ manufacturerchevrolet+
                manufacturerford+
                manufacturerhonda+
                manufacturernissan+
                manufacturerpontiac+
                manufacturertoyota+
                manufacturervolkswagen+
                displ+
                year
                #transfactortransautol4
                #transfactortransautol5
               , data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)

## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4844 -1.0884 -0.0168  1.0811 13.4786 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -153.23187   66.81224  -2.293 0.022748 *  
## manufacturerchevroletTRUE     3.37717    0.60323   5.598 6.29e-08 ***
## manufacturerfordTRUE          1.19666    0.52566   2.276 0.023762 *  
## manufacturerhondaTRUE         4.28807    0.82778   5.180 4.94e-07 ***
## manufacturernissanTRUE        1.88155    0.66320   2.837 0.004971 ** 
## manufacturerpontiacTRUE       2.70460    1.03239   2.620 0.009401 ** 
## manufacturertoyotaTRUE        1.61596    0.45521   3.550 0.000469 ***
## manufacturervolkswagenTRUE    2.20946    0.52363   4.220 3.56e-05 ***
## displ                        -2.59037    0.14857 -17.435  < 2e-16 ***
## year                          0.08878    0.03337   2.660 0.008375 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.247 on 224 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7212 
## F-statistic: 67.98 on 9 and 224 DF,  p-value: < 2.2e-16

CSVPath=tempdir()
data1=cbind(data0,expected, actual, residual)
fcsv=paste(CSVPath,"/mpgdata.csv",sep="")
write.csv(data1[(nn+1):(length(data1$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=20,
         treeSummaryMinBucket=30,
         useSubDir=FALSE,
         tempDirFolderName="mpg")

## [1] 20
## [1] 30
## [1] 0
## [1] 0
## [1] TRUE
## [1] 20
## [1] "Doing level 1: ALL"
## [1] 20

## 
##  node number: 2 
##    root
##    model=4runner 4wd,gti,land cruiser wagon 4wd,passat,pathfinder 4wd,range rover,tiburon,toyota tacoma 4wd
## 
##  node number: 6 
##    root
##    model=altima,camry,camry solara,corolla,forester awd,grand cherokee 4wd,grand prix,impreza awd,jetta,maxima,mountaineer 4wd,navigator 2wd,new beetle
##    hwy< 28.5
## 
##  node number: 7 
##    root
##    model=altima,camry,camry solara,corolla,forester awd,grand cherokee 4wd,grand prix,impreza awd,jetta,maxima,mountaineer 4wd,navigator 2wd,new beetle
##    hwy>=28.5
## [1] 2
## [1] "RESIDUAL:: ALL:ALL :: 0.283(45.3%: 53 of 117 in tree, E=16.1, A=16.5, residual=0.405) :: model=altima,camry,camry solara,corolla,forester awd,grand cherokee 4wd,grand prix,impreza awd,jetta,maxima,mountaineer 4wd,navigator 2wd,new beetle and hwy< 28.5"
## [1] 1

# potential terms identified in residual scan
# RESIDUAL: ALL,ALL,0.575,34.2,40,117,16,16.6,0.575,model< 16.5 and hwy< 28.5 and 
#                                                   manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru
# RESIDUAL: fl,r,1.11,38.4,33,86,20.1,21.2,1.11,hwy>=26.5
# RESIDUAL: class,compact,0.816,100,32,32,0,0,0,NA and root
# RESIDUAL: manufacturer,toyota,7.14e-13,100,34,34,0,0,0,NA and root
# RESIDUAL: trans,manual(m5),0.566,100,36,36,0,0,0,NA and root
# RESIDUAL: transfactor,transmanual(m5),0.566,100,36,36,0,0,0,NA and root

# add terms to dataset and refit
data$hwy=data0$hwy
data$fl=data0$fl
data$model=as.numeric(as.factor(data0$model))
data$model16hwy28manufacturer=(data$model< 16.5) & (data$hwy< 28.5)&(data$manufacturer=="jeep"|data$manufacturer=="lincoln"|data$manufacturer=="mercury"|data$manufacturer=="nissan"|data$manufacturer=="pontiac"|data$manufacturer=="subaru")
data$flr_hwy26=(data$fl=="r") & (data$hwy>=26.5)
data$transfactortransmanualm5=(data$transfactor=='transmanual(m5)')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$classcompact=(data0$class=='compact')
data$flr=(data$fl=='r')
secondmodel=lm(formula=y ~ manufacturerchevrolet+
                            manufacturerford+
                            manufacturerhonda+
                            manufacturernissan+
                            manufacturerpontiac+
                            manufacturertoyota+
                            manufacturervolkswagen+
                            displ+
                            year+
                            # new terms
                            #model16hwy28manufacturer+
                            flr_hwy26+
                            transfactortransmanualm5+
                            manufacturertoyota
                            #classcompact+
                            #flr
               , data=data)
expected=predict(secondmodel,data)

summary(firstmodel)

## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4844 -1.0884 -0.0168  1.0811 13.4786 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -153.23187   66.81224  -2.293 0.022748 *  
## manufacturerchevroletTRUE     3.37717    0.60323   5.598 6.29e-08 ***
## manufacturerfordTRUE          1.19666    0.52566   2.276 0.023762 *  
## manufacturerhondaTRUE         4.28807    0.82778   5.180 4.94e-07 ***
## manufacturernissanTRUE        1.88155    0.66320   2.837 0.004971 ** 
## manufacturerpontiacTRUE       2.70460    1.03239   2.620 0.009401 ** 
## manufacturertoyotaTRUE        1.61596    0.45521   3.550 0.000469 ***
## manufacturervolkswagenTRUE    2.20946    0.52363   4.220 3.56e-05 ***
## displ                        -2.59037    0.14857 -17.435  < 2e-16 ***
## year                          0.08878    0.03337   2.660 0.008375 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.247 on 224 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7212 
## F-statistic: 67.98 on 9 and 224 DF,  p-value: < 2.2e-16

summary(secondmodel)

## 
## Call:
## lm(formula = y ~ manufacturerchevrolet + manufacturerford + manufacturerhonda + 
##     manufacturernissan + manufacturerpontiac + manufacturertoyota + 
##     manufacturervolkswagen + displ + year + flr_hwy26 + transfactortransmanualm5 + 
##     manufacturertoyota, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0734 -1.2143 -0.0734  0.8966 13.5541 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -136.72051   67.18819  -2.035 0.043051 *  
## manufacturerchevroletTRUE       2.97267    0.59233   5.019 1.07e-06 ***
## manufacturerfordTRUE            0.99000    0.50859   1.947 0.052852 .  
## manufacturerhondaTRUE           3.75090    0.80683   4.649 5.72e-06 ***
## manufacturernissanTRUE          1.60842    0.64179   2.506 0.012923 *  
## manufacturerpontiacTRUE         2.22541    1.01005   2.203 0.028604 *  
## manufacturertoyotaTRUE          1.10726    0.45313   2.444 0.015322 *  
## manufacturervolkswagenTRUE      2.18570    0.50406   4.336 2.20e-05 ***
## displ                          -2.28853    0.15787 -14.497  < 2e-16 ***
## year                            0.07985    0.03356   2.379 0.018196 *  
## flr_hwy26TRUE                   1.60708    0.43107   3.728 0.000245 ***
## transfactortransmanualm5TRUE    0.70155    0.36450   1.925 0.055546 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.161 on 222 degrees of freedom
## Multiple R-squared:  0.7544, Adjusted R-squared:  0.7423 
## F-statistic:    62 on 11 and 222 DF,  p-value: < 2.2e-16

# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")

## These 1 new feature(s) were added to your data ('df'):
## [1] "df$auto_feat1 <- as.numeric(df$model %in% c('altima','camry','camry solara','corolla','forester awd','grand cherokee 4wd','grand prix','impreza awd','jetta','maxima','mountaineer 4wd','navigator 2wd','new beetle and hwy< 28.5'))"
## df$auto_feat1 <- as.numeric(df$model %in% c('altima','camry','camry solara','corolla','forester awd','grand cherokee 4wd','grand prix','impreza awd','jetta','maxima','mountaineer 4wd','navigator 2wd','new beetle and hwy< 28.5'))

head(dataWithNewFeatures)

##   manufacturer model displ year cyl      trans drv  y hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f 18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f 21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f 20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f 21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f 16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f 18  26  p compact
##       transfactor auto_feat1
## 1   transauto(l5)          0
## 2 transmanual(m5)          0
## 3 transmanual(m6)          0
## 4   transauto(av)          0
## 5   transauto(l5)          0
## 6 transmanual(m5)          0

# https://vincentarelbundock.github.io/Rdatasets/datasets.html
# http://www.public.iastate.edu/~hofmann/data_in_r_sortable.html

The adjusted R-squared has improved from 0.721 to 0.745 as a result of the newly found features.

A more challenging dataset

A more challenging dataset is stock index data, where the target is to predict future relative movements of two indices, DAX and SMI. FindFeatures is able to identify features which can be added to the model as shown:

library(featurefinder)
data=futuresdata
data$SMIfactor=paste("smi",as.matrix(data$SMIfactor),sep="")
n=length(data$DAX)
nn=floor(length(data$DAX)/2)

# Can we predict the relative movement of DAX and SMI?
data$y=data$DAX*0 # initialise the target to 0
data$y[1:(n-1)]=((data$DAX[2:n])-(data$DAX[1:(n-1)]))/
  (data$DAX[1:(n-1)])-(data$SMI[2:n]-(data$SMI[1:(n-1)]))/(data$SMI[1:(n-1)])

# Fit a simple model
firstmodel=lm(formula=y ~ DAX+SMI+
                            #CAC+
                            FTSE,
                            #SMIfactorsmi1
                            data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
data0=data
data=cbind(data,expected, actual, residual)

CSVPath=tempdir()
fcsv=paste(CSVPath,"/futuresdata.csv",sep="")
write.csv(data[(nn+1):(length(data$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=30,
         treeSummaryMinBucket=50,
         useSubDir=FALSE,
         tempDirFolderName="futures")

## [1] 30
## [1] 50
## [1] 0
## [1] 0
## [1] TRUE
## [1] 20
## [1] "Doing level 1: ALL"
## [1] 30

## 
##  node number: 1 
##    root
## [1] 1

newfeat1=((data$SMIfactor=="smi0") & (data$CAC < 2253) & (data$CAC< 1998) & (data$CAC>=1882)) * 1.0
newfeat2=((data$SMIfactor=="smi1") & (data$SMI < 7837) & (data$SMI >= 7499)) * 1.0
newfeatures=cbind(newfeat1, newfeat2) # create columns for the newly found features
datanew=cbind(data0,newfeatures)

secondmodel=lm(formula=y ~ DAX+SMI+
                           #CAC+
                           FTSE+
                           #SMIfactorsmi1+
                           newfeat1+newfeat2,
                data=datanew[,])
expectednew=predict(secondmodel,datanew)

require(Metrics)
OriginalRMSE = rmse(data$y,expected)
NewRMSE = rmse(data$y,expectednew)

print(paste("OriginalRMSE = ",OriginalRMSE))

## [1] "OriginalRMSE =  0.00756638670538071"

print(paste("NewRMSE = ",NewRMSE))

## [1] "NewRMSE =  0.00752004575897434"

summary(firstmodel)

## 
## Call:
## lm(formula = y ~ DAX + SMI + FTSE, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.047277 -0.004239  0.000183  0.004355  0.031747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.099e-03  2.372e-03   2.994 0.002794 ** 
## DAX         -4.765e-06  1.289e-06  -3.697 0.000224 ***
## SMI          5.354e-06  1.318e-06   4.064 5.03e-05 ***
## FTSE        -3.722e-06  1.345e-06  -2.768 0.005690 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007575 on 1856 degrees of freedom
## Multiple R-squared:  0.009349,   Adjusted R-squared:  0.007747 
## F-statistic: 5.838 on 3 and 1856 DF,  p-value: 0.0005745

summary(secondmodel)

## 
## Call:
## lm(formula = y ~ DAX + SMI + FTSE + newfeat1 + newfeat2, data = datanew[, 
##     ])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046980 -0.004204  0.000177  0.004306  0.032340 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.962e-03  2.359e-03   2.951  0.00320 ** 
## DAX         -5.724e-06  1.302e-06  -4.395 1.17e-05 ***
## SMI          5.711e-06  1.314e-06   4.346 1.46e-05 ***
## FTSE        -3.458e-06  1.347e-06  -2.567  0.01034 *  
## newfeat1     1.217e-03  4.266e-04   2.853  0.00438 ** 
## newfeat2     4.580e-03  1.220e-03   3.753  0.00018 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007532 on 1854 degrees of freedom
## Multiple R-squared:  0.02145,    Adjusted R-squared:  0.01881 
## F-statistic: 8.126 on 5 and 1854 DF,  p-value: 1.345e-07

# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")
head(dataWithNewFeatures)

##       DAX    SMI    CAC   FTSE SMIfactor            y
## 1 1628.75 1678.1 1772.8 2443.6      smi0 -0.015480678
## 2 1613.63 1688.5 1750.5 2460.2      smi0  0.001450780
## 3 1606.51 1678.6 1718.0 2448.2      smi0  0.005767910
## 4 1621.04 1684.1 1708.1 2470.4      smi0 -0.003261110
## 5 1618.16 1686.6 1723.1 2484.7      smi0  0.004227839
## 6 1610.61 1671.6 1714.3 2466.8      smi0  0.005744589

The newly discovered features are statistically significant, with improved adjusted R-squared, lower residual errors and improved F-statistic and p-value.