Lists of term-frequency pairs into a matrix in R -


i have large data set in following format, on each line there document, encoded word:freqency-in-the-document, separated space; lines can of variable length:

aword:3 bword:2 cword:15 dword:2 bword:4 cword:20 fword:1 etc... 

e.g., in first document, "aword" occurs 3 times. want create little search engine, documents (in same format) matching query ranked; though using tfidf , tm package (based on tutorial, requires data in format of termdocumentmatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). otherwise, use tm's termdocumentmatrix function on corpus of text, catch here have these data indexed in format (and i'd rather use these data, unless format alien , cannot converted).

what i've tried far import lines , split them:

docs <- scan("data.txt", what="", sep="\n") doclist <- strsplit(docs, "[[:space:]]+") 

i figured put in loop:

doclist2 <- strsplit(doclist, ":", fixed=true) 

and somehow paired values array, , run loop populates matrix (pre-filled zeroes: matrix(0,x,y)) fetching appripriate values word:freq pairs (would in idea construct matrix?). way of converting not seem way it, lists keep getting more complicated, , wouldn't still know how point can populate matrix.

what (think i) need in end matrix this:

        doc1 doc2 doc3 doc4 ... aword   3    0    0    0  bword   2    4    0    0 cword:  15   20   0    0 dword   2    0    0    0 fword:  0    1    0    0 ... 

which convert termdocumentmatrix , started tutorial. have feeling missing obvious here, cannot find because don't know these things called (i've been googling day, on theme of "term document vector/array/pairs", "two-dimensional array", "list matrix" etc).

what way such list of documents matrix of term-document frequencies? alternatively, if solution obvious or doable built-in functions: actual term format described above, there term:frequency pairs on line, , each line document?

here's approach gets output might want:

## sample data x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1") ## split on spaces , colons     b <- strsplit(x, "\\s+|:") ## add names list represent source document b <- setnames(b, paste0("document", seq_along(b))) ## put long matrix out <- do.call(rbind, lapply(seq_along(b), function(x)    cbind(document = names(b)[x], matrix(b[[x]], ncol = 2, byrow = true,                                 dimnames = list(null, c("word", "count"))))))  ## convert data.frame out <- data.frame(out) out #    document  word count # 1 document1 aword     3 # 2 document1 bword     2 # 3 document1 cword    15 # 4 document1 dword     2 # 5 document2 bword     4 # 6 document2 cword    20 # 7 document2 fword     1 ## make sure counts column number out$count <- as.numeric(as.character(out$count))  ## use xtabs output want xtabs(count ~ word + document, out) #        document # word    document1 document2 #   aword         3         0 #   bword         2         4 #   cword        15        20 #   dword         2         0 #   fword         0         1 

note: answer edited use matrices in creation of "out" minimize number of calls read.table major bottleneck bigger data.


Comments

Popular posts from this blog

c++ - How to add Crypto++ library to Qt project -

jQuery Mobile app not scrolling in Firefox -

how to receive file in java(servlet/jsp) -