PicoStorage : C++ Lightweight Structured Storage Library

Introduction

Structured Storage allows you to have a hierarhical organisation, similar to a file system, inside a single file. Example applications of structured storage are:

What is PicoStorage : C++ Lightweight Structured Storage

PicoStorage is an implementation of the Structured Storage concept. PicoStorage allows you to organize and access data that is stored in a single filesystem file hirarchically, in directories and files. The distinctive traits of PicoStorage are:

Interface Overview

Example 1: here we open a storage from an OS file, and get a reference to its root directory; we create a new file in an existing directory, and write a text to the file.
#include "Storage.hpp"

int main() {
  Dir root    = Storage::openFile("store.lss", EXISTING_OR_CREATE_INIT);
  Dir dir1    = root->openDir("dir1", EXISTING);  
  File myFile = root->openFile("myFile", CREATE);
  const char txt[] = "hello world";
  myFile->write(txt, sizeof(txt));
  return 0;
}
Note: The main concepts are Dir and File. Dir represents a directory. You use it for opening / creating, copying and moving files or sub-directories. You use a File for reading and writing the file, and for obtaining/setting the file size and the file pointer.

                  As you see in the example, you don't need to use new or delete. Dir and File are smart pointers: they behave mostly like ordinary pointers, but you don't have to deallocate (delete) them.

                  Example 2: enumerating the content of a directory.

#include "Storage.hpp"
#include <stdio.h>

int main() {
  Dir root    = Storage::openFile("store.lss", EXISTING_OR_CREATE_INIT);
  Dir someDir = root->path("foo/bar/myDir", EXISTING);  
  for (DirIt it = someDir.begin(); !it->atEnd(); it->next()) {
    fprintf("%s\n", it->getName());
  }
  return 0;
}

Interface

typedef std::vector<std::string> SegmentPath;

class BaseDir;
typedef SmartPtr<BaseDir> Dir;

class BaseDirIt;
typedef SmartPtr<BaseDirIt> DirIt;

class BaseFile;
typedef SmartPtr<BaseFile> File;

class BaseDir : public RefCount {
public:
    virtual File openFile(const std::string &name, 
                          OpenMode mode) = 0;

    virtual Dir openDir(const std::string &name, 
                        OpenMode mode) = 0;

    Dir path(const std::string &pathName, OpenMode mode);
    Dir path(const SegmentPath &pathName, OpenMode mode);

    void copy(Dir dstDir);
    void move(Dir dstDir);

    bool copy(const std::string &srcName, Dir dstDir, const std::string &dstName);
    bool move(const std::string &srcName, Dir dstDir, const std::string &dstName);

    virtual void remove(const std::string &name) = 0;

    virtual DirIt begin() = 0;
};

class BaseFile : public RefCount {
public:
    static const uint64 kMaxFileSize = 1LL << 48;

    virtual uint32 read(void *buf, uint32 nBytes) = 0;
    virtual uint32 pread(void *buf, uint32 nBytes, uint64 pos) = 0;

    virtual void write(const void *buf, uint32 nBytes) = 0;
    virtual void pwrite(const void *buf, uint32 nBytes, uint64 pos) = 0;

    virtual uint64 seek(int64 pos, int whence = S_SET) = 0;
    virtual uint64 tell() = 0;

    virtual uint64 getSize() = 0;
    virtual void setSize(uint64 size) = 0;

    uint64 copyRange(uint64 nBytes, uint64 srcPos, uint64 dstPos);
    
    uint64 copyRange(File dst, uint64 nBytes, 
                     uint64 srcPos, uint64 dstPos);

    void copyTo(File dst);
};

class BaseDirIt : public RefCount {
public:
    bool operator==(DirIt it);
    virtual const std::string &getName() = 0;
    virtual bool isDir() = 0;
    virtual void next() = 0;
    virtual bool atEnd() = 0;
};

Design and Implementation

The idea is to have a simple and minimalist but efficient hiearchical storage layer. More complex functionality (e.g. Unix-like access rights, MS properties, different categories of files, etc) can later be built upon this layer.

The persistent file format

This is the specification of the file format used by PicoStorage. This is how PicoStorage organizes data in the single filesystem file in order to support multiple files and directories inside it.

                  The storage file format is platform-independent and indianess-independent. This allows to move the storage file between different computers. The numerical values are stored as big-endian (aka. network byte order).

                 

Header 8 bytes. Contains a magic (signature) number (4 bytes), and the file format version number (4 bytes).
Inodes table size The number of entries in the inodes table. (4 bytes).
Inodes table A sequence of inode entries. The number of entries is given by the Inodes table size (previous field). Each inode entry 12 bytes. The first inode entry (with index 0) is never used. See below for details.
Allocated areas and empty areas (slices). Each allocated slice is referenced by exactly one inode entry. The empty slices are not referenced.

                  An inode entry identifies a slice. A slice is a contigous sequence of bytes in the storage file. Slices can't overlap. A slice must start after the inodes table, and must be completely contained in the storage file.

                  An inode entry:

                 

Start offset the offset inside the storage file of the first byte of the slice. (8 bytes)
Len the length in bytes of the slice. (4 bytes). (The slice is a contigous range of bytes)

                  PicoStorage supports big files (>2GB), in the sense that the storage file (the underlying filesystem file) can be larger than 2GB. The Start offset of a slice is 8bytes in order to allow slices to be allocated after the 2GB or 4GB offset. But the length of a slice uses 4bytes, which means that the length of a file inside the storage is limited to 4GB.

                  We use the notation (<offset>, <len>) for an inode entry and to denote a slice.
The value (0, 0) for an inode means that the inode is not allocated. It means that the inode is not in use and can be allocated when somebody requests an inode.
The value (1, 0) for an inode is special, and means an inode that is allocated, but points to a slice with 0 lenght. Any slice that has 0 lenght must use this special value. Note that this value (1, 0) does not conflict with any ordinary inode value, because it has a start offset of 1, which is before the inodes table; but all ordinary slices start after the end of the inodes table.
So all normal slices have length of at least 1. They can't overlap. Two normal slices can not start at the same offset.

                  We call inode number the index of an inode in the inode table. The inode number 0 is never used. The size of the inode table is limited to 2^31, so we can have at most 2^31 - 1 inodes (because 0 isn't used).

File

A file is associated with a slice. A file is identified by a reference to the storage and the inode number of the slice. Thus the intrinsec memory overhead of an open file is mainly: a pointer to the storage class, the inode number (4bytes), and the file pointer position (8bytes).

                  Because a file corresponds to a slice, and a slice is contigous, a file is always contigous. There is no file fragmentation, because a file is always composed of a unique contigous fragment. When a file needs to grow (typically, when writing data at the end, or past the end of the file) its slice needs to grow. If there's no empty (available) space just after the current end of the slice, the slice is moved to a new position where it has the space needed to grow. This slice move implies a copy in the underlying storage file of a lenght equal to the current lenght of the file. So, growing a big file can potentially be slow.
I did this tradeoff: prefer simplicity (a file is contigous) against potentially slow grow for large files. Here's why: typically, the files inside the storage are not very large. File which are of an order of 100MB in size are more likely to be stored as filesystem files. In storage we more often find e.g. graphic sprites a few KB in size. So the thradeoff is not a problem for the typical use. Also, it's ok to use large files as long as they don't grow often. Considering for example the case of a big log file, where data is always appended at the end. A possible solution is to pre-allocate the space by growing the file in a few big leaps rather than in many small increments.

Directory

A directory works above a file. It imposes a particular structure to the file content.

                  A directory is organized as a sequence of entries, possibly with empty space in between. The entries are variable size. Between successive directory entries there can be empty space. The empty space bytes are always zeroed. An entry has this structure:

                 

Name lenght (1 byte)
Name Variable size. Contains name lenght bytes. The name is 8bit clean (e.g. the null byte is legal in name).
Inode number and type. (4 bytes). The most significant bit is set to 1 if the referenced inode corresponds to a directory. Otherwise (most significant bit is 0) the referenced inode corresponds to a file.
The rest of the bits contain the inode number. (hence the limit of 2^31 inodes)

                The names of the directory entries are limited to 255 bytes. They are 8bit-clean. They need not be 0-terminated (in fact, 0 is a legal byte inside the name). This allows to use any encoding for the entry names (e.g. it's possible to use UTF8, UTF16, ISO8859-2 -- it's up to the application).

                The directory structure, by using variable-lenght entries, is space-efficient: the maximum name lenght is 255bytes, but if the names are short, the directory will take up less space.

                The root directory has inode number 1.

Download Lightweight Structured Storage Library

PicoStorage is available under the terms of the GNU General Public License (GPL). The source code compilation was tested with

You can use these tools:
Scons a great Make replacement.
Mingw GCC for Windows.

 

version date notes
0.0 2003/10/15 GPL license picoStorage-0.0-src-GPL.tgz
0.0 2004/04/05 BSD license picoStorage-0.0-src-BSD.tgz

 

Copyright (C) 2003 Mihai Preda.

PicoStorage : C++ Lightweight Structured Storage Library