Back in the 90s and early 00s, when Google Docs didn’t exist yet and devices
from Apple weren’t as popular, the primary way to create, edit and share
documents and spreadsheets was Microsoft Office and its proprietary .doc
and
.xls
binary formats. At that time working with these files outside of
Microsoft’s ecosystem was a world of pain, there were multiple attempts to
reverse engineer these formats, but none of them were good enough.
Fortunately, later Microsoft transitioned towards using an XML-based format: XLSX. Even now I occasionally stumble upon a service that yields a file like this instead of a CSV. There is a plenty of apps that open these files without Microsoft Office installed, even standard QuickLook on iOS and macOS can give you a preview for it.
But previews and apps are only ok for manual operations. What if there’s a need for automation? For example, my specific use case is to transform an Excel spreadsheet using only specific rows and columns into a CSV file with a subset of columns from the original file. If you’re like me and would like to stay in the Swift ecosystem, there are no great options. There’s an Objective-C library that at the moment of publishing this article hasn’t been updated in more than 2 years. There’s another C library that can create and write XLSX files, but as far as I know, can’t read one.
And implementing support for a file format is a huge and complex task, right? There’s some compressed data involved, multiple entities to parse and there’s a ton of variation in different files which is hard to support?
Turns out, an XLSX file is a simple zip archive, you can poke at its raw content with a simple terminal command:
$ unzip file.xlsx
Archive: file.xlsx
inflating: _rels/.rels
inflating: docProps/core.xml
inflating: docProps/app.xml
inflating: xl/workbook.xml
inflating: xl/_rels/workbook.xml.rels
inflating: xl/theme/theme1.xml
inflating: xl/worksheets/sheet1.xml
inflating: xl/sharedStrings.xml
inflating: xl/styles.xml
inflating: [Content_Types].xml
When there’s a need to get data out of a single worksheet, there are only 3 files to parse:
_rels/.rels
– this XML file stores paths to “documents” contained within the
main file.xlsx
archivexl/_rels/workbook.xml.rels
– this file referenced from _rels/.rels
contains paths of worksheets that documents in the archive havexl/worksheets/sheet1.xml
– this is where worksheet data lives: rows,
columns and cells.There are multiple ways to implement parsers for different formats. JavaScript,
for example, comes with a standard
JSON.parse
function, which is able to parse JSON, but for any other format, you’re on your
own. All 3rd-party libraries for JavaScript like to invent their own parsing
APIs from scratch.
This was even worse not so long ago with the first few versions of Swift when we not only had ad-hoc APIs for other formats like XML but dozens of JSON parsing libraries as well. 😅 But I’m so happy to deal with Swift nowadays after it gained a standard API for almost any serialization format you’d need.
The main concepts in this serialization API in Swift are the Codable
typealias:
typealias Codable = Decodable & Encodable
and the Decodable
/Encodable
protocols:
protocol Decodable {
init(from decoder: Decoder) throws
}
protocol Encodable {
func encode(to encoder: Encoder) throws
}
There are also Encoder
and Decoder
protocols mentioned above, but they
don’t really matter in the day-to-day use of Codable
. The most wonderful thing
here is that Swift compiler is able to derive an implementation of Codable
for your types automatically, as long as Codable
implementation is available
on types of properties that you use. And obviously, Codable
is already
implemented on all primitive types like Bool
, Int
, String
etc.
Even better, implementation of Codable
on your types can be reused for
multiple formats if needed. For example Foundation
module provides standard
JSONEncoder
and
JSONDecoder
types that allow serialising your data in and out of JSON in a standard way.
But for our case, we only need a 3rd-party implementation of this for XML, like
XMLCoder
.1
As shown earlier, our first step towards reading data from an Excel file is
parsing _rels/.rels
in the archive, which looks like this:
<Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship
Id="rId1"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"
Target="docProps/core.xml"/>
<Relationship
Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties"
Target="docProps/app.xml"/>
<Relationship
Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target="xl/workbook.xml"/>
< /Relationships>
The most primitive node we’d like to handle in a type-safe manner is Type
of the Relationship
, let’s create an enum
for it (I’ve replaced long schema
urls with [...]
for brevity):
enum SchemaType: String, Codable {
case officeDocument = "[...]/officeDocument"
case extendedProperties = "[...]/extended-properties"
case coreProperties = "[...]/core-properties"
case worksheet = "[...]/worksheet"
case sharedStrings = "[...]/sharedStrings"
case styles = "[...]/styles"
case theme = "[...]/theme"
}
A complete version is available on GitHub.
Note the Codable
conformance on this enum
, we don’t need to implement
anything for it manually as it will be generated automatically by the compiler
for us. The next step is the Relationship
node itself:
struct Relationship: Codable {
let id: String
let type: SchemaType
let target: String
}
And then it’s only the root Relationships
node left:
struct Relationships: Codable {
let items: [Relationship]
enum CodingKeys: String, CodingKey {
case items = "relationship"
}
}
Note the CodingKeys
enum here conforming to the
CodingKey
protocol. It allows us to specify a different name for a property that contains
the decoded data. In our case, there are multiple Relationship
children nodes
within the root Relationships
node. It’s better when these nodes are stored
in a property named items
rather than a default relationship
property
you’d have to create without CodingKeys
.
The main property we’re interested in here though is the target
on a
Relationship
node with type officeDocument
, it contains a path to the
document. But also, remember that all of this XML is stored within a zip
archive, how do we unzip it in the first place?
A great indicator of Swift ecosystem maturity is the number of good libraries you can find for the most popular tasks when working on production apps. And it’s pretty good, well, except for XLSX libraries until now. 😂 But at the moment we also need to uncompress a zip archive in memory and it looks like ZIPFoundation is the best implementation that supports both Linux and Apple platforms.
Let’s create the main entry point for the public API of our XLSX library:
public struct XLSXFile {
}
The main type we need to work in ZIPFoundation
is Archive
, let’s
add it as a private property and also create an instance of XMLCoder
to reuse
later:
public struct XLSXFile {
public let filepath: String
private let archive: Archive
private let decoder: XMLDecoder
public init?(filepath: String) {
let archiveURL = URL(fileURLWithPath: filepath)
guard let archive = Archive(url: archiveURL, accessMode: .read) else {
return nil
}
self.archive = archive
self.filepath = filepath
decoder = XMLDecoder()
}
}
Now we also need a generic way to parse an XML file from the archive, let’s
add a function on XLSXFile
and also a helper Error
subtype for it.
This parseEntry
function should take a path like _rels/.rels
and a type
of representation of a root node and return an instance of that type or throw
an error:
public enum XLSXReaderError: Error {
case archiveEntryNotFound
}
public struct XLSXFile {
// ...
private let archive: Archive
private let decoder: XMLDecoder
/// Parse a file within `archive` at `path`. Parsing result is
/// an instance of `type`.
func parseEntry<T: Decodable>(_ path: String,
_ type: T.Type) throws -> T {
guard let entry = archive[path] else {
throw XLSXReaderError.archiveEntryNotFound
}
var result: T?
_ = try archive.extract(entry) {
result = try decoder.decode(type, from: $0)
}
return result!
}
}
And now we can use this function to get the list of documents within
the archive. Here .convertFromCapitalized
setting specifies we’d like
to convert capitalised attribute names to lowercase properties so that Target
becomes target
, Type
becomes type
and Id
is mapped to id
.
public struct XLSXFile {
// ...
/// Return an array of paths to relationships of type `officeDocument`
func parseDocumentPaths() throws -> [String] {
decoder.keyDecodingStrategy = .convertFromCapitalized
return try parseEntry("_rels/.rels", Relationships.self).items
.filter { $0.type == .officeDocument }
.map { $0.target }
}
}
I omit a completely boring 20-line function parseWorksheetPaths
where we parse
a similar relationships file to get paths to the actual worksheets. While this
function is going to be used later, it’s very similar to parseDocumentPaths
we’ve just implemented above. Instead, let’s focus on a much more interesting
model type Worksheet
:
public struct Worksheet: Codable {
public let sheetPr: SheetPr?
public let dimension: WorksheetDimension
public let sheetViews: SheetViews
public let sheetFormatPr: SheetFormatPr
public let cols: Cols
public let sheetData: SheetData
public let mergeCells: MergeCells?
}
This type directly represents a worksheet XML you’ll be able to find in most
common XLSX files. Basically, the rest of the source code of the parsing
library is struct
declarations of the property types you find on the root
Worksheet
XML node and the rest of the types match its descendant nodes.
For example, one such type is Cell
:
public struct Cell: Codable, Equatable {
public let reference: String
public let type: String?
public let s: String?
public let inlineString: InlineString?
public let formula: String?
public let value: String?
enum CodingKeys: String, CodingKey {
case formula = "f"
case value = "v"
case inlineString = "is"
case reference = "r"
case type = "t"
case s
}
}
And sorry I don’t know what exactly the attribute s
stands for ¯\_(ツ)_/¯
.
It’s still not easy to find detailed documentation on all of the attributes, but
I’ve been able to provide sensible names to most of them.
For some reason worksheet XML nodes and attributes aren’t capitalised, so we
need to reset the decoding strategy for parsing them to .useDefaultKeys
. This
will leave node and attribute names lowercase as they are.
public struct XLSXFile {
// ...
private let decoder: XMLDecoder
/// Parse a worksheet at a given path contained in this XLSX file.
public func parseWorksheet(at path: String) throws -> Worksheet {
decoder.keyDecodingStrategy = .useDefaultKeys
return try parseEntry(path, Worksheet.self)
}
}
Now we’ve got the library working, let’s read a simple file:
guard let file = XLSXFile(filepath: "./file.xlsx") else {
fatalError("XLSX file corrupted or does not exist")
}
for path in try file.parseWorksheetPaths() {
let ws = try file.parseWorksheet(at: path)
for row in ws.sheetData.rows {
for c in row.cells {
print(c)
}
}
}
This will print every cell in every worksheet in an XLSX file provided. And that’s mostly all there is to it. 🙌
It seems like there are no cross-platform libraries for parsing XLSX available
to use with Swift. Turns out, implementing file format support from scratch
isn’t that hard with the great
Codable
API available in Swift standard library. XLSX files are actually zip archives
with multiple XML files within. In this article, we had a look at the code that
decompresses an archive in memory and parses the XML files to get cell data for
any given worksheet.
I’ve published this as CoreXLSX
package available to use with CocoaPods and Swift
Package Manager. I hope this library saves
you some time when you need to read something from an XLSX file in an automated
way. Or maybe you could use it as an example for implementing other file formats
in your apps and libraries. Either way, I’d appreciate your feedback on
Twitter or
Mastodon and would be very happy if you star
the repository and create issues and pull requests at GitHub. 🤩
This library is a hard fork of a great
XMLParsing
package by Shawn
Moore. Unfortunately, it looks like the
original repository doesn’t accept PRs for known issues and hasn’t been
updated in some time.
Some of these
issues
were blocking my XLSX parsing implementation. To make matters worse, CocoaPods
doesn’t allow dependencies on forks with the same name as original libraries,
so I had to create a fork with a new name. I wish I could merge my
contributions back. 😔
If you enjoyed this article, please consider becoming a sponsor. My goal is to produce content like this and to work on open-source projects full time, every single contribution brings me closer to it! This also benefits you as a reader and as a user of my open-source projects, ensuring that my blog and projects are constantly maintained and improved.