Combining rows of unequal length into a matrix or data.frame

In preparation of Utrecht University’s Mplus summer school, I am developing some R functions for running and plotting mixture models. Doing so requires a bit of data wrangling, because Mplus output is essentially plain text. Moreover, the formatting can differ between sections. One section consisted of rows of different lengths, because section headers were printed only once for a block of rows, instead of being repeated for each row. I came up with a fast solution to rbind rows of different lengths, and to repeat the section headers across lines. First, here is an example of the raw text output from Mplus:

mplus_output
##  [1] "FINAL CLASS COUNTS AND PROPORTIONS FOR EACH LATENT CLASS VARIABLE"
##  [2] "BASED ON THE ESTIMATED MODEL"                                     
##  [3] ""                                                                 
##  [4] "  Latent Class"                                                   
##  [5] "    Variable    Class"                                            
##  [6] ""                                                                 
##  [7] "    C1             1       525.87720          0.26294"            
##  [8] "                   2       749.33826          0.37467"            
##  [9] "                   3       724.78455          0.36239"            
## [10] "    C2             1       391.52097          0.19576"            
## [11] "                   2      1132.59033          0.56630"            
## [12] "                   3       475.88873          0.23794"            
## [13] ""

When I try to parse this block into a matrix, the first row causes a bit of trouble, because the latent variable names (C1, C2) are printed only once, ostensibly for aesthetic reasons.

# Select lines with numerical output for parsing
numberLines <- grep("^\\s*([a-zA-Z0-9]+)?(\\s+[0-9\\.-]{1,}){1,}$", mplus_output, perl=TRUE)
# Split each numerical line into elements by whitespace
parsedlines <- strsplit(trimws(mplus_output[numberLines]), "\\s+")
# Try to rbind to rows into a matrix
do.call(rbind, parsedlines)
## Warning in (function (..., deparse.level = 1) : number of columns of result
## is not a multiple of vector length (arg 2)
##      [,1] [,2]         [,3]        [,4]     
## [1,] "C1" "1"          "525.87720" "0.26294"
## [2,] "2"  "749.33826"  "0.37467"   "2"      
## [3,] "3"  "724.78455"  "0.36239"   "3"      
## [4,] "C2" "1"          "391.52097" "0.19576"
## [5,] "2"  "1132.59033" "0.56630"   "2"      
## [6,] "3"  "475.88873"  "0.23794"   "3"

This yields an error, and a matrix with shifted values. Below is my code that addresses this problem. The steps are simple:

  1. Count the length of each line
  2. Pad the shorter lines with s on the left side
  3. Rbind again
# Count the length of each line
line_lengths <- sapply(parsedlines, length)
# Pad shorter lines with NA on the left side
parsedlines[which(line_lengths != max(line_lengths))] <- 
  lapply(parsedlines[which(line_lengths != max(line_lengths))], function(x){
    c(NA, x)
    })

output <- do.call(rbind, parsedlines)
output
##      [,1] [,2] [,3]         [,4]     
## [1,] "C1" "1"  "525.87720"  "0.26294"
## [2,] NA   "2"  "749.33826"  "0.37467"
## [3,] NA   "3"  "724.78455"  "0.36239"
## [4,] "C2" "1"  "391.52097"  "0.19576"
## [5,] NA   "2"  "1132.59033" "0.56630"
## [6,] NA   "3"  "475.88873"  "0.23794"

As we see, this yields a nice matrix. There is one remaining problem, namely that the latent variable names (C1, C2) are only printed once. I would like these values to be repeated for every row pertaining to that latent variable. My steps for this are as follows:

  1. Figure out how many times each name should be repeated
  2. Repeat the name

For this, I used a combination of , to identify the distance between successive non-missing entries of the first column, and , a function that expands a run-length encoded vector into a full vector.

# How many repeats each?
lengths <- diff(c(which(!is.na(output[,1])), (nrow(output)+1)))
# Confirm that each entry should be repeated 3 times
lengths
## [1] 3 3
# Values to repeat are simply the non-missing values in the first column
values = output[,1][complete.cases(output[,1])]
values
## [1] "C1" "C2"

output[,1] <- inverse.rle(list(lengths = lengths, values = values))
output
##      [,1] [,2] [,3]         [,4]     
## [1,] "C1" "1"  "525.87720"  "0.26294"
## [2,] "C1" "2"  "749.33826"  "0.37467"
## [3,] "C1" "3"  "724.78455"  "0.36239"
## [4,] "C2" "1"  "391.52097"  "0.19576"
## [5,] "C2" "2"  "1132.59033" "0.56630"
## [6,] "C2" "3"  "475.88873"  "0.23794"

One important advantage of this approach is that it’s quite fast, about 10x faster than other syntax I found on the internet, which only partially solved my problem (i.e., only combined vectors of unequal length into a matrix). This makes it especially suitable for inclusion in a parser that reads large blocks of free text into a tabular format. I hope this may be of some use to others!

Caspar van Lissa avatar
About Caspar van Lissa
Interdisciplinary social scientist and datascience dilletante.
comments powered by Disqus