--- title: "FeatureFinder" author: "Richard Davis" date: "`r Sys.Date()`" output: html_document: toc: yes vignette: > %\VignetteIndexEntry{FeatureFinder} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\ImageOutputResolu --- FeatureFinder is designed to give comprehensive and accurate sets of features which can be used in modelling, either to build a new model or to enhance and diagnose an existing model. Both methods are available through a single function, FindFeatures. The following give examples for each method. ## Identifying Features for a model target If you have not yet built a model, findFeatures can be used to identify promising features. A typical modelling scenario involves a table consisting of a set of predictors $\{x_i\}$ and a model target $y$. ## Identifying Features for a model residual If you already have a model, and want to find features to further improve it, findFeatures can be (repeatedly) run using residuals. A typical modelling scenario involves a table consisting of a set of predictors $\{x_i\}$ and a model residual $r = y - p$, where $p$ is a prediction from a previously-fit model and $y$ is the model target. ## Scanning over partitions of the data The function generates a decision tree for the entire table, as well as decision trees for every possible subset of the table. Subsets are defined using factor-valued columns in the data. Fators can either be user-defined or already included as predictors. The more factors that are created in the data, the more partitions that will be tested and so it is helpful to create a comprehensive set of factors for this purpose. One easy way to create factors is to bin each predictor into 10 bands and create a factor for each. Factor labels should be prefixed with a string character so that they are interpreted as factors and not numerics, for example "s1", "s2" rather than "1","2" labels. ## The feature-finding process For the case where no model has been fitted yet, we simply define $r=y$, and for the case when a model has been fitted already we use the residual $r = y - p$. We supply a single table consisting of all predictors together with $residual=r, actual=y,expected=p$ and call the findFeatures function. Each decision tree will consist of an rpart tree as shown in the following example. Leaves are labelled with a residual and a leaf volume $n$. Nodes are also labelled with the cut-rule for the node, and these are used to identify the leaves. The code scans each leaf for cases with sufficiently high volume and a residual value which exceeds a user-specified threshold. These parameters are outlined in help(findFeatures). When a leaf meeting the criteria is found, it is printed in the txt file for the partition being scanned. These text files can then be manually or automatically parsed and included in models as required. Often a leaf will be a clue rather than the final form of a feature, and so manual inspection can be of assistance. ```{r, echo=FALSE, out.width="850px"} knitr::include_graphics("./vignettefigures/vignettefigure.png") ``` The summary of residual nodes according to user-specified criteria for residual value and leaf volume will be generated in txt files (for example treesAll.txt and allfactors\\.[partitionvariable]factor.txt). These contain a summary of each significant term with its definition, volume and other parameters as shown: * Partition variable (categorical factor) * The level of the partition variable * Residual * Leaf volume vs partition volume (percentage) * Leaf volume * Partition volume * Expected value average in leaf * Actual value average in leaf * Residual (checkval) * Leaf definition within this partition of the partition variable In the examples, partitioning enables significant leaves to be found for each partition, although the full dataset does not yield leaves in the fitted tree. This illustrates the benefits of the partitioning technique. ## Standard dataset Once features are found, they can be customised and added to the model as shown: ```{r, echo=-2, fig.show='hold', fig.width=7, include=TRUE, warning=FALSE, message=FALSE} library(featurefinder) # mpg cars example data(mpgdata) data=mpgdata # define some categorical factors here, for use in partition scanning. Define as many as desired. data$transfactor=paste("trans",as.matrix(data$trans),sep="") data$transfactor=as.factor(data$transfactor) # define data dimensions n=dim(data)[1] # total dimension nn=floor(dim(data)[1]/2) # split point for training and test data=as.data.frame(data) nm=names(data) nm[8]='y' ## select a column to be the target of the model names(data)=nm data0=data # retain full dataset data=data[c("manufacturer","displ","year","transfactor","y") ] # select a subset for our first model firstmodel=lm(formula=y ~ .,data=data) expected=predict(firstmodel,data) actual=data$y residual=actual-expected summary(firstmodel) # drop terms that are not significant and refit model data$manufacturerchevrolet=(data$manufacturer=='chevrolet') data$manufacturerford=(data$manufacturer=='ford') data$manufacturerhonda=(data$manufacturer=='honda') data$manufacturernissan=(data$manufacturer=='nissan') data$manufacturerpontiac=(data$manufacturer=='pontiac') data$manufacturertoyota=(data$manufacturer=='toyota') data$manufacturervolkswagen=(data$manufacturer=='volkswagen') #data$displ data$transfactortransautol4=(data$transfactor=='transauto(l4)') data$transfactortransautol5=(data$transfactor=='transauto(l5)') firstmodel=lm(formula=y ~ manufacturerchevrolet+ manufacturerford+ manufacturerhonda+ manufacturernissan+ manufacturerpontiac+ manufacturertoyota+ manufacturervolkswagen+ displ+ year #transfactortransautol4 #transfactortransautol5 , data=data) expected=predict(firstmodel,data) actual=data$y residual=actual-expected summary(firstmodel) CSVPath=tempdir() data1=cbind(data0,expected, actual, residual) fcsv=paste(CSVPath,"/mpgdata.csv",sep="") write.csv(data1[(nn+1):(length(data1$y)),],file=fcsv,row.names=FALSE) exclusionVars="\"residual\",\"expected\", \"actual\",\"y\"" factorToNumericList=c() # Now the dataset is prepared, try to find new features tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList, treeGenerationMinBucket=20, treeSummaryMinBucket=30, useSubDir=FALSE, tempDirFolderName="mpg") # potential terms identified in residual scan # RESIDUAL: ALL,ALL,0.575,34.2,40,117,16,16.6,0.575,model< 16.5 and hwy< 28.5 and # manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru # RESIDUAL: fl,r,1.11,38.4,33,86,20.1,21.2,1.11,hwy>=26.5 # RESIDUAL: class,compact,0.816,100,32,32,0,0,0,NA and root # RESIDUAL: manufacturer,toyota,7.14e-13,100,34,34,0,0,0,NA and root # RESIDUAL: trans,manual(m5),0.566,100,36,36,0,0,0,NA and root # RESIDUAL: transfactor,transmanual(m5),0.566,100,36,36,0,0,0,NA and root # add terms to dataset and refit data$hwy=data0$hwy data$fl=data0$fl data$model=as.numeric(as.factor(data0$model)) data$model16hwy28manufacturer=(data$model< 16.5) & (data$hwy< 28.5)&(data$manufacturer=="jeep"|data$manufacturer=="lincoln"|data$manufacturer=="mercury"|data$manufacturer=="nissan"|data$manufacturer=="pontiac"|data$manufacturer=="subaru") data$flr_hwy26=(data$fl=="r") & (data$hwy>=26.5) data$transfactortransmanualm5=(data$transfactor=='transmanual(m5)') data$manufacturertoyota=(data$manufacturer=='toyota') data$classcompact=(data0$class=='compact') data$flr=(data$fl=='r') secondmodel=lm(formula=y ~ manufacturerchevrolet+ manufacturerford+ manufacturerhonda+ manufacturernissan+ manufacturerpontiac+ manufacturertoyota+ manufacturervolkswagen+ displ+ year+ # new terms #model16hwy28manufacturer+ flr_hwy26+ transfactortransmanualm5+ manufacturertoyota #classcompact+ #flr , data=data) expected=predict(secondmodel,data) summary(firstmodel) summary(secondmodel) # Append new features from the scan to a dataframe automatically dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_") head(dataWithNewFeatures) # https://vincentarelbundock.github.io/Rdatasets/datasets.html # http://www.public.iastate.edu/~hofmann/data_in_r_sortable.html ``` ```{r, echo=FALSE, fig.show='hold', fig.width=7, include=FALSE, message=FALSE, warning=FALSE} # move example to a subfolder unlink("mpgexample", recursive=TRUE) dir.create("mpgexample", showWarnings = FALSE) file.copy(Sys.glob("./*.Rdata"), "./mpgexample/") file.copy(Sys.glob("./*.png"), "./mpgexample/", recursive=TRUE) file.copy(Sys.glob("./*.txt"), "./mpgexample/", recursive=TRUE) unlink(Sys.glob("./*.Rdata"), recursive=FALSE) unlink(Sys.glob("./*.png"), recursive=FALSE) unlink(Sys.glob("./*.txt"), recursive=FALSE) ``` The adjusted R-squared has improved from 0.721 to 0.745 as a result of the newly found features. ## A more challenging dataset A more challenging dataset is stock index data, where the target is to predict future relative movements of two indices, DAX and SMI. FindFeatures is able to identify features which can be added to the model as shown: ```{r, echo=-2, fig.show='hold', fig.width=7, include=TRUE, message=FALSE, warning=FALSE} library(featurefinder) data(futuresdata) data=futuresdata data$SMIfactor=paste("smi",as.matrix(data$SMIfactor),sep="") n=length(data$DAX) nn=floor(length(data$DAX)/2) # Can we predict the relative movement of DAX and SMI? data$y=data$DAX*0 # initialise the target to 0 data$y[1:(n-1)]=((data$DAX[2:n])-(data$DAX[1:(n-1)]))/ (data$DAX[1:(n-1)])-(data$SMI[2:n]-(data$SMI[1:(n-1)]))/(data$SMI[1:(n-1)]) # Fit a simple model firstmodel=lm(formula=y ~ DAX+SMI+ #CAC+ FTSE, #SMIfactorsmi1 data=data) expected=predict(firstmodel,data) actual=data$y residual=actual-expected data0=data data=cbind(data,expected, actual, residual) CSVPath=tempdir() fcsv=paste(CSVPath,"/futuresdata.csv",sep="") write.csv(data[(nn+1):(length(data$y)),],file=fcsv,row.names=FALSE) exclusionVars="\"residual\",\"expected\", \"actual\",\"y\"" factorToNumericList=c() # Now the dataset is prepared, try to find new features tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList, treeGenerationMinBucket=30, treeSummaryMinBucket=50, useSubDir=FALSE, tempDirFolderName="futures") newfeat1=((data$SMIfactor=="smi0") & (data$CAC < 2253) & (data$CAC< 1998) & (data$CAC>=1882)) * 1.0 newfeat2=((data$SMIfactor=="smi1") & (data$SMI < 7837) & (data$SMI >= 7499)) * 1.0 newfeatures=cbind(newfeat1, newfeat2) # create columns for the newly found features datanew=cbind(data0,newfeatures) secondmodel=lm(formula=y ~ DAX+SMI+ #CAC+ FTSE+ #SMIfactorsmi1+ newfeat1+newfeat2, data=datanew[,]) expectednew=predict(secondmodel,datanew) require(Metrics) OriginalRMSE = rmse(data$y,expected) NewRMSE = rmse(data$y,expectednew) print(paste("OriginalRMSE = ",OriginalRMSE)) print(paste("NewRMSE = ",NewRMSE)) summary(firstmodel) summary(secondmodel) # Append new features from the scan to a dataframe automatically dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_") head(dataWithNewFeatures) ``` The newly discovered features are statistically significant, with improved adjusted R-squared, lower residual errors and improved F-statistic and p-value. ```{r, echo=FALSE, fig.show='hold', fig.width=7, include=FALSE, message=FALSE, warning=FALSE} # move example to a subfolder unlink("daxexample", recursive=TRUE) dir.create("daxexample", showWarnings = FALSE) file.copy(Sys.glob("./*.Rdata"), "./daxexample/") file.copy(Sys.glob("./*.png"), "./daxexample/", recursive=TRUE) file.copy(Sys.glob("./*.txt"), "./daxexample/", recursive=TRUE) unlink(Sys.glob("./*.Rdata"), recursive=FALSE) unlink(Sys.glob("./*.png"), recursive=FALSE) unlink(Sys.glob("./*.txt"), recursive=FALSE) ```