Max Desiatov

Max DesiatovI'm Max Desiatov, a software consultant building mobile and backend apps. My interests include coding in Swift and TypeScript, product design, video games and music.

Blog

Hire Me

Easy parsing of Excel spreadsheet format with Swift's Codable protocols

12 November, 2018

Back in the 90s and early 00s, when Google Docs didn’t exist yet and devices from Apple weren’t as popular, the primary way to create, edit and share documents and spreadsheets was Microsoft Office and its proprietary .doc and .xls binary formats. At that time working with these files outside of Microsoft’s ecosystem was a world of pain, there were multiple attempts to reverse engineer these formats, but none of them were good enough.

Microsoft Excel 2010 running on Windows 7

Image from en.wikipedia

Fortunately, later Microsoft transitioned towards using an XML-based format: XLSX. Even now I occasionally stumble upon a service that yields a file like this instead of a CSV. There is a plenty of apps that open these files without Microsoft Office installed, even standard QuickLook on iOS and macOS can give you a preview for it.

But previews and apps are only ok for manual operations. What if there’s a need for automation? For example, my specific use case is to transform an Excel spreadsheet using only specific rows and columns into a CSV file with a subset of columns from the original file. If you’re like me and would like to stay in the Swift ecosystem, there are no great options. There’s an Objective-C library that at the moment of publishing this article hasn’t been updated in more than 2 years. There’s another C library that can create and write XLSX files, but as far as I know, can’t read one.

And implementing support for a file format is a huge and complex task, right? There’s some compressed data involved, multiple entities to parse and there’s a ton of variation in different files which is hard to support?

Easy peasy, it’s just an archive of XML files!

Turns out, an XLSX file is a simple zip archive, you can poke at its raw content with a simple terminal command:

$ unzip file.xlsx
Archive:  file.xlsx
  inflating: _rels/.rels             
  inflating: docProps/core.xml       
  inflating: docProps/app.xml        
  inflating: xl/workbook.xml         
  inflating: xl/_rels/workbook.xml.rels  
  inflating: xl/theme/theme1.xml     
  inflating: xl/worksheets/sheet1.xml  
  inflating: xl/sharedStrings.xml    
  inflating: xl/styles.xml           
  inflating: [Content_Types].xml 

When there’s a need to get data out of a single worksheet, there are only 3 files to parse:

  1. _rels/.rels – this XML file stores paths to “documents” contained within the main file.xlsx archive
  2. xl/_rels/workbook.xml.rels – this file referenced from _rels/.rels contains paths of worksheets that documents in the archive have
  3. xl/worksheets/sheet1.xml – this is where worksheet data lives: rows, columns and cells.

What does “Codable” mean in Swift?

There are multiple ways to implement parsers for different formats. JavaScript, for example, comes with a standard JSON.parse function, which is able to parse JSON, but for any other format, you’re on your own. All 3rd-party libraries for JavaScript like to invent their own parsing APIs from scratch.

This was even worse not so long ago with the first few versions of Swift when we not only had ad-hoc APIs for other formats like XML but dozens of JSON parsing libraries as well. 😅 But I’m so happy to deal with Swift nowadays after it gained a standard API for almost any serialization format you’d need.

The main concepts in this serialization API in Swift are the Codable typealias:

typealias Codable = Decodable & Encodable

and the Decodable/Encodable protocols:

protocol Decodable {
  init(from decoder: Decoder) throws
}

protocol Encodable {
  func encode(to encoder: Encoder) throws
}

There are also Encoder and Decoder protocols mentioned above, but they don’t really matter in the day-to-day use of Codable. The most wonderful thing here is that Swift compiler is able to derive an implementation of Codable for your types automatically, as long as Codable implementation is available on types of properties that you use. And obviously, Codable is already implemented on all primitive types like Bool, Int, String etc.

Even better, implementation of Codable on your types can be reused for multiple formats if needed. For example Foundation module provides standard JSONEncoder and JSONDecoder types that allow serialising your data in and out of JSON in a standard way. But for our case, we only need a 3rd-party implementation of this for XML, like XMLCoder.1

“Codable” in practice

As shown earlier, our first step towards reading data from an Excel file is parsing _rels/.rels in the archive, which looks like this:

<Relationships 
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship 
  Id="rId1" 
  Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
  Target="docProps/core.xml"/>

  <Relationship 
  Id="rId2" 
  Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
  Target="docProps/app.xml"/>

  <Relationship 
  Id="rId3" 
  Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
  Target="xl/workbook.xml"/>

< /Relationships>

The most primitive node we’d like to handle in a type-safe manner is Type of the Relationship, let’s create an enum for it (I’ve replaced long schema urls with [...] for brevity):

enum SchemaType: String, Codable {
  case officeDocument = "[...]/officeDocument"
  case extendedProperties = "[...]/extended-properties"
  case coreProperties = "[...]/core-properties"
  case worksheet = "[...]/worksheet"
  case sharedStrings = "[...]/sharedStrings"
  case styles = "[...]/styles"
  case theme = "[...]/theme"
}

A complete version is available on GitHub.

Note the Codable conformance on this enum, we don’t need to implement anything for it manually as it will be generated automatically by the compiler for us. The next step is the Relationship node itself:

struct Relationship: Codable {
  let id: String
  let type: SchemaType
  let target: String
}

And then it’s only the root Relationships node left:

struct Relationships: Codable {
  let items: [Relationship]

  enum CodingKeys: String, CodingKey {
    case items = "relationship"
  }
}

Note the CodingKeys enum here conforming to the CodingKey protocol. It allows us to specify a different name for a property that contains the decoded data. In our case, there are multiple Relationship children nodes within the root Relationships node. It’s better when these nodes are stored in a property named items rather than a default relationship property you’d have to create without CodingKeys.

The main property we’re interested in here though is the target on a Relationship node with type officeDocument, it contains a path to the document. But also, remember that all of this XML is stored within a zip archive, how do we unzip it in the first place?

Unzipping an archive in memory and parsing the contents

A great indicator of Swift ecosystem maturity is the number of good libraries you can find for the most popular tasks when working on production apps. And it’s pretty good, well, except for XLSX libraries until now. 😂 But at the moment we also need to uncompress a zip archive in memory and it looks like ZIPFoundation is the best implementation that supports both Linux and Apple platforms.

Let’s create the main entry point for the public API of our XLSX library:

public struct XLSXFile {
}

The main type we need to work in ZIPFoundation is Archive, let’s add it as a private property and also create an instance of XMLCoder to reuse later:

public struct XLSXFile {
  public let filepath: String
  private let archive: Archive
  private let decoder: XMLDecoder

  public init?(filepath: String) {
    let archiveURL = URL(fileURLWithPath: filepath)

    guard let archive = Archive(url: archiveURL, accessMode: .read) else {
      return nil
    }

    self.archive = archive
    self.filepath = filepath

    decoder = XMLDecoder()
  }
}

Now we also need a generic way to parse an XML file from the archive, let’s add a function on XLSXFile and also a helper Error subtype for it. This parseEntry function should take a path like _rels/.rels and a type of representation of a root node and return an instance of that type or throw an error:

public enum XLSXReaderError: Error {
  case archiveEntryNotFound
}

public struct XLSXFile {
  // ...
  private let archive: Archive
  private let decoder: XMLDecoder

  /// Parse a file within `archive` at `path`. Parsing result is
  /// an instance of `type`.
  func parseEntry<T: Decodable>(_ path: String, 
                                _ type: T.Type) throws -> T {
    guard let entry = archive[path] else {
      throw XLSXReaderError.archiveEntryNotFound
    }

    var result: T?

    _ = try archive.extract(entry) {
      result = try decoder.decode(type, from: $0)
    }

    return result!
  }
}

And now we can use this function to get the list of documents within the archive. Here .convertFromCapitalized setting specifies we’d like to convert capitalised attribute names to lowercase properties so that Target becomes target, Type becomes type and Id is mapped to id.

public struct XLSXFile {
  // ...

  /// Return an array of paths to relationships of type `officeDocument`
  func parseDocumentPaths() throws -> [String] {
    decoder.keyDecodingStrategy = .convertFromCapitalized

    return try parseEntry("_rels/.rels", Relationships.self).items
      .filter { $0.type == .officeDocument }
      .map { $0.target }
  }
}

Parsing a worksheet

I omit a completely boring 20-line function parseWorksheetPaths where we parse a similar relationships file to get paths to the actual worksheets. While this function is going to be used later, it’s very similar to parseDocumentPaths we’ve just implemented above. Instead, let’s focus on a much more interesting model type Worksheet:

public struct Worksheet: Codable {
  public let sheetPr: SheetPr?
  public let dimension: WorksheetDimension
  public let sheetViews: SheetViews
  public let sheetFormatPr: SheetFormatPr
  public let cols: Cols
  public let sheetData: SheetData
  public let mergeCells: MergeCells?
}

This type directly represents a worksheet XML you’ll be able to find in most common XLSX files. Basically, the rest of the source code of the parsing library is struct declarations of the property types you find on the root Worksheet XML node and the rest of the types match its descendant nodes. For example, one such type is Cell:

public struct Cell: Codable, Equatable {
  public let reference: String
  public let type: String?
  public let s: String?
  public let inlineString: InlineString?
  public let formula: String?
  public let value: String?

  enum CodingKeys: String, CodingKey {
    case formula = "f"
    case value = "v"
    case inlineString = "is"
    case reference = "r"
    case type = "t"
    case s
  }
}

And sorry I don’t know what exactly the attribute s stands for ¯\_(ツ)_/¯. It’s still not easy to find detailed documentation on all of the attributes, but I’ve been able to provide sensible names to most of them.

For some reason worksheet XML nodes and attributes aren’t capitalised, so we need to reset the decoding strategy for parsing them to .useDefaultKeys. This will leave node and attribute names lowercase as they are.

public struct XLSXFile {
  // ...
  private let decoder: XMLDecoder

  /// Parse a worksheet at a given path contained in this XLSX file.
  public func parseWorksheet(at path: String) throws -> Worksheet {
    decoder.keyDecodingStrategy = .useDefaultKeys

    return try parseEntry(path, Worksheet.self)
  }
}

Using the library

Now we’ve got the library working, let’s read a simple file:

guard let file = XLSXFile(filepath: "./file.xlsx") else {
  fatalError("XLSX file corrupted or does not exist")
}

for path in try file.parseWorksheetPaths() {
  let ws = try file.parseWorksheet(at: path)
  for row in ws.sheetData.rows {
    for c in row.cells {
      print(c)
    }
  }
}

This will print every cell in every worksheet in an XLSX file provided. And that’s mostly all there is to it. 🙌

Summary

It seems like there are no cross-platform libraries for parsing XLSX available to use with Swift. Turns out, implementing file format support from scratch isn’t that hard with the great Codable API available in Swift standard library. XLSX files are actually zip archives with multiple XML files within. In this article, we had a look at the code that decompresses an archive in memory and parses the XML files to get cell data for any given worksheet.

I’ve published this as CoreXLSX package available to use with CocoaPods and Swift Package Manager. I hope this library saves you some time when you need to read something from an XLSX file in an automated way. Or maybe you could use it as an example for implementing other file formats in your apps and libraries. Either way, I’d appreciate your feedback on Twitter or Mastodon and would be very happy if you star the repository and create issues and pull requests at GitHub. 🤩


  1. This library is a hard fork of a great XMLParsing package by Shawn Moore. Unfortunately, it looks like the original repository doesn’t accept PRs for known issues and hasn’t been updated in some time. Some of these issues were blocking my XLSX parsing implementation. To make matters worse, CocoaPods doesn’t allow dependencies on forks with the same name as original libraries, so I had to create a fork with a new name. I wish I could merge my contributions back. 😔