r - Check if one sample exists in the sample space (From PCA or other cluster analysis) -
i have 200 50 matrix, here 200 means 200 compounds (row) , 50 means 50 independent varialbes (column), , use 200 * 50 matrix cluster analysis (e.g. k-mean etc.), can plot show distributions these 2000 compounds.
my question when have new compound, have same 50 independent variable 200 * 50 matrix, how can test if new compound located in cluster space?
thanks.
edit: plz note not need find element in data.frame. think first step cluster data (for example, using pca , plot(pca1, pca2)), test if new record located in plot or out. like picture, (2) belongs cluster , (1) not belong cluster space, this.
here simple solution:
step1: setup data
set.seed(1) refdata <- data.frame(matrix(runif(200*50),nrow=200)) newrec01 <- refdata[11,] # record exists in data newrec02 <- runif(50) # record not exist in data
step2: testing:
true %in% sapply(1:nrow(refdata),function(i) all(newrec01 == refdata[i,])) true %in% sapply(1:nrow(refdata),function(i) all(newrec02 == refdata[i,]))
if needed can package in function:
checknewrec <- function(refdata, newrec) { true %in% sapply(1:nrow(refdata),function(i) all(newrec == refdata[i,])) } checknewrec(refdata, newrec01) checknewrec(refdata, newrec02)
edit: based on new input below, try following:
prep: code comments:
<- rbind(refdata, newrec02) pca <- prcomp(all) pca1 <- pca$x[, 1] pca2 <- pca$x[, 2] pca1.in <- pca1[-length(pca1)] pca2.in <- pca2[-length(pca2)]
now need define cluster in way. simplicity, lets assume single cluster.
step1: find out centroid of refdata:
cent <- c(mean(pca1.in),mean(pca2.in))
step2: find out distance of data points center of refdata:
ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2
step3: need choose cut off distance center beyond new incoming record considered "outside" cluster. simplicity, taking dec
ision @ 95th % quantile:
dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1))
step4: decision has been made on classification of newrec
, can plot it:
plot(pca1, pca2) points(pca1[length(pca1)], pca2[length(pca2)], col = ifelse(dec, "red", "green"),pch="x")
additionally, verify our dec
ision, lets plot errors, , see newrec
fall!!
hist(ssq, main="error histogram",xlab="square error") points(pca1[length(pca1)], pca2[length(pca2)], col = ifelse(dec, "red", "green"),pch="x") text(pca1[length(pca1)], pca2[length(pca2)],labels="new rec",col="red",pos=3)
hope helps!!
Comments
Post a Comment