Question:
I'm pretty new to programming and R, so I'm really lost with this problem:
I need to do a Kruskal-Wallis analysis for a large number of numerical variables with respect to different categorical variables and obtain a significance value for each of the numerical variables. My data is more or less like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate (...)
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008'
(...)
Being Nunatak
, Slope
, Altitude
and Depth
the categorical variables and the rest ( Fluoride
…) the numerical ones.
To avoid having to repeat:
kruskal.test("Factor a analizar 1"~"Variable de categorización 1", data=env_fact)
As many times as I have variables, a colleague has helped me create a 'for' loop like the following:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we did not manage to write a code that allows us to extract the test values for each of the numerical variables (my.variables) that we analyze, but we have only managed to obtain a significance value for the analysis as if it were carried out with all numeric variables at once.
Any idea how to modify the small code to be able to have on screen or in an output the values of the Kruskal-Wallis test for each of the numerical variables that I need to analyze?
Thank you very much in advance
Answer:
First, to make my answer reproducible, we load the data you gave as an example into a data.frame
:
env_fact <- read.table(text="Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008", sep=',', header=TRUE, stringsAsFactors=FALSE)
Well, from what you show, you are wanting to apply the Kruskal-Wallis test by using a formula, individually it would be something like this:
kruskal.test(Nunatak ~ Fluoride, data=env_fact)
In this example Nunatak ~ Fluoride
is the formula, and in order to do all the tests, which I understand, are between each categorical variable and each numerical variable, we should be able to dynamically define this formula. To define a formula, for example from a string, we have as.formula()
which eventually allows us to do something like this: as.formula("Nunatak ~ Fluoride")
, notice that what we are passing is a string.
That said, we first define the two groups of variables:
categorical_vars <- c('Nunatak','Slope','Altitude','Depth')
numerical_vars <- c('Fluoride','Acetate','Formiate','Chloride','Nitrate')
The following is to do a cycle for each categorical_vars
and in each iteration, another cycle for each numerical_vars
, additionally, it will be convenient to save the output of each test, in a list, to be able to access their results later:
kret = list()
i <- 1
for (c in categorical_vars) {
for (n in numerical_vars) {
f <- as.formula(paste(c, '~', n))
kret[[deparse(f)]] <- kruskal.test(f, data=env_fact)
i <- i + 1
}
}
What you achieve with this is: (a) run the test between each categorical and each numeric variable (b) end with a kret
list where the results of each of the tests will be. To access the list and results later, you can do it by index:
kret[7]
$`Slope ~ Acetate`
Kruskal-Wallis rank sum test
data: Slope by Acetate
Kruskal-Wallis chi-squared = 9, df = 9, p-value = 0.4373
Or as we have named each element with the formula, we could access directly by these:
kret['Slope ~ Nitrate']
$`Slope ~ Nitrate`
Kruskal-Wallis rank sum test
data: Slope by Nitrate
Kruskal-Wallis chi-squared = 4.7143, df = 6, p-value = 0.5809