Various ways to parse raw data with MKDataScanner

Recently I did some more work with raw data and I missed simple tool like NSScanner to do the job. So I created MKDataScanner, which is NSScanner but for NSData (byte by byte) and for raw file content. It's not a big deal, just at tool I want use to improve my codebase. Support for few data providers with different streamed file content is what interest me most. That lead me to this blogpost about, how to read large files on iOS

in memory

Mobile devices like iPhone have limited amount of memory availabe for single application (process), so when I want work with file I can load the whole file to memory and scan it from single NSData object (NSData memory is organized as continous, single block of memory). This is maybe not the best (especially for large files) way to deal with content of file, but is simplest for sure

NSData *fileContent = [NSData dataWithContentsOfFile:@"/path/file.data"];

nonetheless, when I know that my file is small, this is fastest possible way to randomly access content of the file. No need to engage more sophisticated methods just to process few bytes of data.

mmap

For large files better approach is map the file from disk to virtual memory. It can be done by specifying a hint to do so. A mapped file uses virtual memory techniques to avoid copying pages of the file into memory until they are actually needed.

NSData *fileContent = [NSData dataWithContentsOfFile:@"/path/file.data" options:NSDataReadingMappedAlways error:nil];

Because of file mapping restrictions, this method should only be used if the file is guaranteed to exist for the duration of the data object’s existence. (NSData Class Reference)

Looks like mapped files are unmapped when low memory message is sent to the application, I should remember to handle this scenario by myown, to stop using it actually, or reinitialize.

stream

The other way is to scan chunks of stream data using NSInputStream. It's easy to use and memory usage should be low because we only keep block of data in memory, not the whole data. NSStream kind of classess are considered as lower-level interface and I can use NSFileHandle for instance (and Apple actually encourage to use it as more abstract interface).

When it comes to read stream

NSInputStream *stream = [NSInputStream inputStreamWithURL:fileURL];
[stream open];

now read and process data. Current location in stream increase with every read. I can move forward, but I can't back obviously (it's stream, stream is linear access) - to back I have to reset stream (close and re-open)

UInt8 buffer[maxLength]; // buffer
NSInteger result = [stream read:buffer maxLength:maxLength];
if (result < 0) {
    return;
} else if (result == 0) {
    // EOF
} else if (result > 0) {
    NSData *data = [NSData dataWithBytes:buffer length:result];
}

get current position like this

NSNumber *value = [stream propertyForKey:NSStreamFileCurrentOffsetKey];
NSInteger position = value.integerValue;

close stream

[stream close];

Remember there is NSFileHandle

For sockets, pipes, and devices, you can use a file handle object to monitor the device and process data asynchronously.

dispatch i/o

Dispatch I/O came with Grand Central Dispatch and looks very GCD'ish. Interface is asynchronous by default so I need some semaphores to do it synchronous if I want to.

Dispatch I/O channels are the preferred way to read and write files because they give you direct control over when file operations occur but still allow you to process the data asynchronously on a dispatch queue. (Apple)

I want fast random access to any part of the file so I create I/O channel of type DISPATCH_IO_RANDOM

dispatch_io_t io = dispatch_io_create_with_path (DISPATCH_IO_RANDOM, "/path/file.data", 0, O_RDONLY, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), nil);

then read until file reach EOF.

If the handler is submitted with the done parameter set to YES, an empty data object, and an error code of 0, it means that the channel reached the end of the file.

another way to detect if file is over is compare current offset with the actual file size.

dispatch_io_read(io, 0, SIZE_MAX, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(bool done, dispatch_data_t data, int error) {
    // process the data and stop when done.
});

remember to close the channel

dispatch_io_close(io)

MKDataScanner

MKDataScanner is available on Github. See project README for details. Share if you like it. Star if you like it. Send pull requests if you want to.

@krzyzanowskim