Lists of term-frequency pairs into a matrix in R -
i have large data set in following format, on each line there document, encoded word:freqency-in-the-document, separated space; lines can of variable length:
aword:3 bword:2 cword:15 dword:2 bword:4 cword:20 fword:1 etc...
e.g., in first document, "aword" occurs 3 times. want create little search engine, documents (in same format) matching query ranked; though using tfidf , tm package (based on tutorial, requires data in format of termdocumentmatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). otherwise, use tm's termdocumentmatrix function on corpus of text, catch here have these data indexed in format (and i'd rather use these data, unless format alien , cannot converted).
what i've tried far import lines , split them:
docs <- scan("data.txt", what="", sep="\n") doclist <- strsplit(docs, "[[:space:]]+")
i figured put in loop:
doclist2 <- strsplit(doclist, ":", fixed=true)
and somehow paired values array, , run loop populates matrix (pre-filled zeroes: matrix(0,x,y)) fetching appripriate values word:freq pairs (would in idea construct matrix?). way of converting not seem way it, lists keep getting more complicated, , wouldn't still know how point can populate matrix.
what (think i) need in end matrix this:
doc1 doc2 doc3 doc4 ... aword 3 0 0 0 bword 2 4 0 0 cword: 15 20 0 0 dword 2 0 0 0 fword: 0 1 0 0 ...
which convert termdocumentmatrix , started tutorial. have feeling missing obvious here, cannot find because don't know these things called (i've been googling day, on theme of "term document vector/array/pairs", "two-dimensional array", "list matrix" etc).
what way such list of documents matrix of term-document frequencies? alternatively, if solution obvious or doable built-in functions: actual term format described above, there term:frequency pairs on line, , each line document?
here's approach gets output might want:
## sample data x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1") ## split on spaces , colons b <- strsplit(x, "\\s+|:") ## add names list represent source document b <- setnames(b, paste0("document", seq_along(b))) ## put long matrix out <- do.call(rbind, lapply(seq_along(b), function(x) cbind(document = names(b)[x], matrix(b[[x]], ncol = 2, byrow = true, dimnames = list(null, c("word", "count")))))) ## convert data.frame out <- data.frame(out) out # document word count # 1 document1 aword 3 # 2 document1 bword 2 # 3 document1 cword 15 # 4 document1 dword 2 # 5 document2 bword 4 # 6 document2 cword 20 # 7 document2 fword 1 ## make sure counts column number out$count <- as.numeric(as.character(out$count)) ## use xtabs output want xtabs(count ~ word + document, out) # document # word document1 document2 # aword 3 0 # bword 2 4 # cword 15 20 # dword 2 0 # fword 0 1
note: answer edited use matrices in creation of "out" minimize number of calls read.table
major bottleneck bigger data.
Comments
Post a Comment