BeOS File System API Notes
Version: 20030505 (not finished yet) For an implementation example, see AGMSRAMFileSystem on BeBits.
These are hints about writing for the BeOS File System Independent/Interface Layer (FSIL), essentially just an explanation of the stuff in fsproto.h, which is partially quoted here (fsproto.h is "Copyright 1999, Be Incorporated. All Rights Reserved. This file may be used under the terms of the Be Sample Code License."). Thanks also go to Dominic Giampaolo for his book "Practical File System Design with the Be File System", to Standard & Western Software for the helpful comments in their NTFS source code, to George Hoffman's article in http://www-classic.be.com/aboutbe/benewsletter/Issue93.html, and to Marco Nelissen for a description of the arguments for each notification operation. This commentary is copyright © 2001 by Alexander G. M. Smith, you may copy it provided you keep all these credits and don't change it for the worse (feel free to add stuff and your own copyright). To contact me (perhaps with updates for this document), search the world wide web for "Alexander G. M. Smith" to find my web site and current e-mail address.
vnode_id
This 64 bit number identifies a particular object on a given file system, be it a file or directory, or even a deleted file or directory (one which doesn't have a parent directory and is thus largely invisible). Well, actually a 63 bit number as pointed out in the example FAT file system comments: "Unfortunately, vnode_id's are defined as signed. This causes problems with programs (notably cp) that use the modulo of a vnode_id (or ino_t) as a hash function to index an array. This means the high bit of every vnode_id is off-limits." It's up to your file system to assign meanings to the numbers, they could be disk block numbers, memory addresses, or anything you want, so long as they can uniquely identify files and directories (so each file or directory should have only one vnode_id number assigned to it). The OS finds out about new numbers by searching directories (see walk), it never guesses. The only one it knows in advance is the vnode_id for the root directory, which it gets from your file system as part of the mount operation.
vnode & FSIL shadow vnode
Vnode stands for "virtual inode"; inodes are a term from UNIX operating systems (try "man inode" to find out more). Each one represents a file or directory (or anything else you can invent if plain files aren't good enough for you; UNIX adds sockets, pipes, special devices and other dodads). Under BeOS the OS allocates an internal vnode record for each file/directory in use, and only one (even if the same file has been opened by several different programs). Conceptually in parallel with each vnode record is your FSIL shadow vnode record, allocated at the same time the vnode record is created (see get_node, read_vnode). Each vnode record contains the 64 bit vnode_id, followed by the 32 bit value that your file system provided as the shadow vnode pointer via read_vnode (usually used as a private file system pointer to per-file data), plus a name space identifier (pointer or index to the file system instance/volume the file belongs to), and a reference counter, and other internal stuff. Your shadow portion of the vnode can be whatever you want it to be, so long as it can be passed around to other operations, such as open, close, read directory, etc. In other words, BeOS sees it just a pointer sized (32 bit) value, which doesn't have to be a pointer since only your code will use it (you can use an array index, or the vnode_id itself if your vnode_id fits in 32 bits or less). Typically your shadow vnode will be an in-memory version of the data structure you have on-disk for cataloging information about a file or directory (owner, file size, date modified, and other things, maybe or maybe not including the name and parent directory depending on the design). Some file systems have it as a copy of the actual disk sector with the info about the file, but then that forces them to use one whole sector for every file's info, though it does make it easy to write the updates back to the disk. If you want to guarantee multitasking safety on a per-file level, your shadow vnode will have a semaphore as part of its data.
nspace_id
This is the same as a dev_t, which is a long integer (32 bits) that identifies a device/volume. It stands for name-space ID. Internally BeOS has a list or array of active file system instances (each one a name-space record containing the nspace_id, a table of function pointers to your file system functions, and the data pointer (really just a 32 bit number, doesn't have to be a pointer) you returned during the mount operation). You get that data pointer passed in as the *ns argument to many functions. The actual nspace_id value is either an array index or pointer to that internal OS record. Anyway, it distinguishes between different mounted file systems, and is used to identify the particular instance of a file system for calls like Notify, which along with a vnode_id tell the OS that a particular file on a particular file system has changed.
List of Functions with Explanations
int read_vnode (void *ns, vnode_id vnid, char r, void **node);
This call from the OS converts a vnode_id into actual data stored in memory, in your shadow portion of a vnode. The ns pointer identifies the file system / name space instance (you returned it in the mount call, probably as a pointer to your file system's semi-global data). Besides returning this node thing (could be a number or anything that fits in a pointer sized 32 bits) that gets saved in the OS's internal vnode record, it also implies that someone is using that file/directory, until write_vnode (or remove_vnode) is called to tell you that the OS has finished using it. Typically your code will read the catalog info on the file/directory and store it in freshly allocated memory, returning a pointer to that memory. If you are doing file level multitasking locking, this is a good place to create a semaphore to protect that file, naturally storing it in your shadow vnode data for easy access in other file operations. If you want to do the same read_vnode operation yourself, don't call your read_vnode function, instead call get_vnode, and the OS will call read_vnode if it doesn't already have that vnode loaded. Finally, the "r" field is a boolean flag to signal a re-entrant call so that you know it was called from a function which your file system called (often from your walk function calling the OS's get_vnode which then calls your read_vnode), in case that affects your usage of semaphores or other multitasking locks. Note that read_vnode is single threaded with respect to a particular vnode_id, so you don't have to worry about two threads trying to load the same vnode_id at the same time. Returns the usual error codes.
int write_vnode (void *ns, void *node, char r);
The OS calls this when it has finished using the vnode (nobody is using it (vnode reference count is zero) and the OS wants to free up its vnode record, this can include parent directories of an open file). To force flushing of vnodes, do an "ls -R" on some other file system which has a lot of files, as there is a limited number of active vnodes (hopefully less than the number of semaphores if you allocate one semaphore per active shadow vnode). See also remove_vnode, which gets called instead of this when deleting a file. You can now write the file info back to disk, or leave it in your cache (the book recommends that you don't change the disk data here, guess they want you to do it at the actual time of the main operations). In any case, this is a cue to deallocate memory and delete semaphores you allocated in read_vnode. The r flag tells you if this is a re-entrant call (one of your file system functions called something in the OS which then called your write_vnode). Single threaded with respect to a particular vnode_id. Returns the usual error codes.
int remove_vnode (void *ns, void *node, char r);
When everybody has finished using a deleted vnode, this gets called, instead of write_vnode. Your file system should then free all resources and also remove the file/directory/thing (disk blocks etc) from your file system. As usual, the r flag is set for re-entrant calls and it returns an error code.
int secure_vnode (void *ns, void *node);
Security related feature. Verifies that the node is valid and access is allowed. This one is used by the OS directly for verifying security on everyday file operations. I think it may be missing a few arguments (I assume you will have to check the current thread's info to find security settings), and BFS apparently doesn't implement it.
int walk (void *ns, void *base, const char *file, char **newpath, vnode_id *vnid);
This is used for traversing down a path string to get the actual files (vnodes) the user wants. Given a directory (base is your shadow vnode directory info data that's associated with the directory vnode) and a file name (in file, with '/' characters already removed), see if you can find that file (or directory or whatnot) inside the base directory. You also have to handle requests for "." and maybe even "..". I incorrectly assumed that the vnodes of all parent directories up to the root level had all been get_vnode'd (otherwise where did it get the value for base?), but testing reveals that they won't stay in memory for the duration (they get flushed with write_vnode if BeOS needs vnodes for other operations). Return the vnode_id in vnid if you found it, and you also need to do a get_vnode on the found file too (otherwise you'll get a kernel error). The OS will do the corresponding put_vnode later, when it needs to recycle the vnode for other purposes. If the name doesn't exist in the directory, return ENOENT. If you find a symbolic link instead of a file, and newpath isn't NULL (if it is NULL, treat the symbolic link like a regular file), copy the symbolic link's string to newpath, using the new_path function to do the string copy/allocate (check if it fails before assigning the result), and don't do get_vnode (or undo it with put_vnode if you did it). BeOS seems to use walk in pairs to find subdirectories, once with the name in the parent directory and once with "." as the name in the vnode it found in the first walk - I think it is to verify the directory access permissions, walking the "." file tests the permissions of that particular directory.
int access (void *ns, void *node, int mode);
Check security permissions to see if the caller is allowed to do the specified operations (mode flags R_OK, W_OK, X_OK and F_OK) on that particular node (the Posix standard says to check that all things in the path pass the permission test, but we only get one node to check so that's it). A mode of F_OK (0) just tests if the thing exists. The others check for the various read/write/execute permissions (using the calling thread's user and group settings). Of course, with BeOS using the root user (user #0) as the default, it automatically gets read and write permission to everything. Note that read only file systems will always fail W_OK, even for the root user. Combinations of flags check that any one or more of the flagged permissions is met. See the Posix access() function for more info, on how it should be done, not how I do it. Returns 0 for access allowed, or an error code for access denied. Only used by user programs, the OS uses secure_vnode for its security checks, and you can always add your own checks to the individual file manipulation functions.
int create (void *ns, void *dir, const char *name, int omode, int perms, vnode_id *vnid, void **cookie);
Create a new file, much like the open() function with the create flag specified (I suspect that you'll never see open() called with the create flag), but it also returns a vnode_id in addition to the open file cookie. It can get called with an existing file if the user calls create with an existing file name. You should use get_vnode to get the existing file in that case, and reuse it (other programs may have it open too). The dir specifies the shadow vnode portion of the vnode for the parent directory. The name is the name of the new file, the VFS layer filters out names which are too long (more than 256 bytes), but doesn't filter out names like "..". The omode argument specifies the open flags, much like open(), which specifies if it should fail if the file exists (or is a directory), or if the existing file should be truncated, or left as is. The PFSDWTBFS book specifies the details. Perms contains the new file access permission bits (rwxrwxrwx etc), which you need to modify by clearing the bits specified by the caller's umask setting (invert the mask and AND it), well, actually BeOS does this for you before passing in the permissions. The userid of the new file is presumably the same as the calling process, while the group is copied from the directory which it is created in. Besides returning the new vnode_id, also return a cookie to the newly open file when done. If you actually created the file, you also have to use new_vnode too (and activate your shadow vnode as if your read_vnode had been called) since having a cookie without an OS vnode plus your shadow vnode is kind of weird - yup, without it you get a kernel error: KERN 'sh'[1929]: KERNEL: FS agmsrfs0: CREATE(/Test/Junk, 0601): vn (2826a0b0) NULL! There may be a race condition with other programs trying to open the file when they see it appearing in an index (like the index of file names), so block other threads in read_vnode until it is ready, or don't let anyone know the new vnode_id until it is ready, then call notify_listener() and do the index stuff if you have indices to update.
int mkdir (void *ns, void *dir, const char *name, int perms);
Make a new directory. If you don't create them on the fly, also add entries for "." and ".." to the new directory. No vnode_id is returned, so you only have to update your data structures, and call notify_listener(), etc. No need for a call to new_vnode. The man pages for mkdir say: The directory path is created with the access permissions specified by perms and restricted by the the umask of the calling process. The directory's owner ID is set to the process's effective user ID. The directory's group ID is set to that of the parent directory in which it is created.
int symlink (void *ns, void *dir, const char *name, const char *path);
Create a new symbolic link thing. Add the name to the given shadow vnode directory. The name contains the path string, which is used during walk() to redirect to the place path names, so save it in association with the new symbolic link's name.
int link (void *ns, void *dir, const char *name, void *node);
Make a hard link to a file/directory/thing. Basically this puts it into the given directory, using the given name, possibly giving it more than one parent directory and multiple names. Often it is a good idea to have a reference counter in your node data so that a thing only gets deleted when it has been removed from all directories it was in. As you can expect, node is your shadow vnode of the thing. Remember to add it to the relevant indices (the new name, and all the attributes if this is the first link - can happen if someone deletes the file then relinks it).
int rename (void *ns, void *olddir, const char *oldname, void *newdir, const char *newname);
Rename the file (or other thing) from the old directory (given its shadow vnode), with the old name, to a new directory (may be the same as the old) with a new name. Must be done atomically, either it succeeds or it returns an error code and does nothing. If the new name exists, it also deletes the existing file in the new directory as well as doing the replacement. This is useful for replacing a file with a new version while ensuring that there is always a file available there, as far as other threads can tell. Avoid cyclical loops - don't rename a directory to be one of its own descendants. The POSIX standard says you can't replace a file with a directory and vice versa, but BFS does allow that. And of course update indices etc.
int unlink (void *ns, void *dir, const char *name);
Remove the file or symbolic link (not directory) with the given name from the given directory (and take that name out of the index if you allow different names in different parent directories). If it is listed in zero directories after that (if you are using hard links it could be in several directories), then also delete it from the file system. Normally the final delete is done by using get_vnode, taking it out of the remaining indices, then calling the OS's remove_vnode() (confusingly named the same as your remove_vnode () hook function) which sets the delete flag on the vnode, then calling put_vnode. When all the users of the vnode have called put_vnode, the OS sees the delete flag and will call your remove_vnode to actually deallocate the file (rather than using write_vnode), so all you should be doing here is taking it out of the directory. That's why the actual deletion of the file itself isn't done here, as there may still be open file handles using the file, even though it isn't listed in any directory.
int rmdir (void *ns, void *dir, const char *name);
Remove the directory with the given name from the given directory. If the directory isn't empty (or the named thing isn't a directory), return an error code and do nothing. Otherwise similar to unlink.
int readlink (void *ns, void *node, char *buf, size_t *bufsize);
Get the path string out of the given symbolic link and copy it into the user's buffer, as much of it that fits. This seems to duplicate some of the functionality of Walk, but the default internal implementation just returns an error rather than using Walk to get the info. Presumably bufsize is always specified (kernel bug if NULL), and on entry contains the size of the user's buffer, and on exit the number of bytes required to hold the full symbolic string. Note that if the buffer is too small, it still returns the number of bytes for the full string, but only writes the amount that fits. buf points to the user's buffer (allowed to be NULL if size is zero). The string written to the buffer doesn't include a trailing NUL (and the size returned reflects that). Of course, the maximum buffer size needed will be the maximum path length (1024 bytes in BeOS).
int opendir (void *ns, void *node, void **cookie);
Starts examining a directory. The node pointer identifies the directory (it's the 32 bit shadow vnode pointer your file system provided during read_vnode). The cookie holds a pointer to a pointer, which you set to point to your private directory state data (usually you allocate it as part of your opendir, containing a counter to keep track of where you are in the directory) or some other 32 bit value (like an array index). The OS will keep the cookie as part of its open directory status, and pass it back to you whenever the user wants more directory information (usually the user calls the Posix functions opendir, readdir, closedir, rewinddir). Incidentally, the Posix functions also seem to open the same directory as a file (so that nobody deletes it while in use) and close it as part of closedir. So, it looks like you don't need to worry about the directory disappearing while it's open, though the contents may change. The function returns the usual error codes.
int closedir (void *ns, void *node, void *cookie);
Stop any further threads from accessing the directory. Don't deallocate the cookie just yet, there could be other threads busy reading at the moment. In fact, you should delete semaphores or otherwise unblock any threads waiting to read this directory via this cookie (and have them return with an error code), and return an error for any further attempts to use the same cookie.
int rewinddir (void *ns, void *node, void *cookie);
Sets the directory state back to the first thing in the directory, so the user can re-read the directory from the beginning.
int readdir (void *ns, void *node, void *cookie, long *num, struct dirent *buf, size_t bufsize);
Read info about one or more things (files or subdirs or ?) in the directory and advance the directory state to the appropriate next thing in the directory, so that the next readdir will continue where this one left off. Keep in mind that the directory contents may change between readdir calls. Note that the dirent structures returned in the buffer are variable sized (the file name at the end of the structure is as big as needed, or optionally padded larger), so use the buffer size in combination with the input value from *num to decide how many can be read, or if you want to keep it simple, just read one (that's what BFS currently does). As noted by S&W, you need to fill in more than just the name; the Tracker uses the d_dev/d_ino fields for displaying icons. You can also fill in the parent info in d_pdev/d_pino if you want to (assuming that there is one parent - currently unimplemented hardlinks would add more parents - but then the directory you are reading would be the obvious parent to use). You shouldn't call get_vnode to set up the shadow vnode for things in the directory, unless you need the shadow vnode for your own purposes (then you must call put_vnode later to balance it out). The d_reclen field in the dirent should contain the length of the entire record, including the name and trailing NUL and padding space (if you have any) up to the next record (so you can add it to the start of the record pointer to get the next record). But the file system examples I've seen use the length of the name, or the length of the name plus one, or other numbers. Fortunately all those examples only return 1 record at a time, so nobody needs their buggy d_reclen. You should return "." and ".." entries too, and the root directory's ".." is best set to itself (the OS will ignore your ".." and use the mount point directory's parent when walking, but just in case...). Actually, you'll see the OS do a walk and rstat on the ".." name after your readdir returns. Returns an error code if it fails. Also you should set *num to the number of entries it read, which can include zero (don't return an ENOENT error when the end of the directory is reached) or less than the requested number. Perhaps return EINVAL/B_BAD_VALUE if the buffer is too small to hold at least one file's dirent.
Here's the famous dirent structure, note the open ended array at the end:
typedef struct dirent {
dev_t d_dev;
dev_t d_pdev;
ino_t d_ino;
ino_t d_pino;
unsigned short d_reclen;
char d_name[1];
} dirent_t;
int free_dircookie (void *ns, void *node, void *cookie);
The last thread has finished reading the directory, you can now really close it. Deallocate the cookie and otherwise clean up here.
int open (void *ns, void *node, int omode, void **cookie);
Opens an existing file or directory or other thing, using your shadow portion of the vnode, which read_node earlier was used to find (thus the file/dir/thing must exist for read_node to work, see the create() function for making new files). The open mode flags in omode specify the type of open (B_READ_ONLY/O_RDONLY, B_WRITE_ONLY/O_WRONLY, B_READ_WRITE/O_RDWR, B_FAIL_IF_EXISTS/O_EXCL, B_CREATE_FILE/O_CREAT, B_ERASE_FILE/O_TRUNC, B_OPEN_AT_END/O_APPEND). You shouldn't see the O_CREAT flag; the create() function will get called by the OS instead of open(). If you can't open the file in the requested mode, return an error, B_PERMISSION_DENIED/EACCES is good for most situations. As usual, allocate a cookie structure with state information about the file (like the open mode - so you can detect attempts to write to a read-only file, though the FSIL sometimes does it for you) and return a pointer to it in *cookie. You may also want to do your own security checking here too, if secure_vnode isn't working. Note that opening directories isn't too useful (the FreeBSD man pages say you should get an error if opening a directory for write access but BeOS allows this, and BNode/BEntry seems to reattempt an open with read access if write fails), but user programs will need to do it in order to use fsync, attribute operations and some other file handle operations (though read or write should fail when used on non-files) or merely to get exclusive access to the directory, or to prevent it from being deleted. Same thing goes for symbolic links and other non-file objects (the BeOS "Tracker" file browser tries to open everything, possibly to look for icon attributes, and locks up for some operations if it can't open the thing).
int close (void *ns, void *node, void *cookie);
Prepare to close the file or directory. Other threads may be busy reading it at this moment, so don't deallocate the cookie yet or deallocate stuff the other threads might need. Still, you might want to bump off other threads which are busy waiting for IO (delete their semaphore or something, forcing them to return an error code to the caller). Further read/writes should return an error like B_FILE_ERROR/EBADF.
int free_cookie (void *ns, void *node, void *cookie);
Nobody is using the file now, it's safe to really close it. Send the final file size change notification message if needed. Deallocate your cookie too. Single threaded with respect to the particular cookie.
int read (void *ns, void *node, void *cookie, off_t pos, void *buf, size_t *len);
Read data from the file. If the thing is a directory, return an error. As the OS is tracking individual open files and their positions, you get the position to read as one of the arguments. The OS also tracks read only and write only modes, and will reject user attempts to do inappropriate reads or writes, so you never see them. The rest of the arguments are as you would expect, with the cookie from the open, the node being your shadow vnode, and so on. See the user level read() for an explanation of how it works. If the user opened the file with the append option, read at the provided position, not the end. Returns a negative error code on failure, or 0 if successful, plus it returns the number of bytes read in *len (can be zero if attempting to read at end of file).
int write (void *ns, void *node, void *cookie, off_t pos, const void *buf, size_t *len);
Write data to a file. See read. Note that for append mode (file opened with O_APPEND), ignore the position the OS gives you (which is an incorrect value anyways - the OS thinks you are writing at the start of the file) and always write at the end (use suitable thread locking so that the active thread gets the real current end of the file position and finishes the write before other theads try to write). This is what BFS does. Oddly enough, the read position that the OS uses in append mode is also off, starting at zero (the beginning) and advancing for both writes and reads, making it useless since the actual written data is at the end of the file. Users will have to do lseek after writing to avoid confusion about the read position. Anyway, this way O_APPEND guarantees that the user's writes will be at the end, even if several threads are writing at the same time (useful for log files where the order in the file should correspond to the to the chronological order of writes by the threads).
int readv (void *ns, void *node, void *cookie, off_t pos, const iovec *vec, size_t count, size_t *len);
Read data from a file into a bunch of buffers. I suspect that if you don't implement this, it gets converted into several single reads by the OS. Ideally, you should set up the device driver to do a scatter-gather read, where it fills the scattered buffers with one IO operation (special hardware is required). The device driver level supports a similar readv operation, so just pass the list of buffers on to the device driver and hope it does it right (the OS will break it down to individual buffer reads if the driver can't handle it).
int writev (void *ns, void *node, void *cookie, off_t pos, const iovec *vec, size_t count, size_t *len);
Write a bunch of buffers in one fast operation. See readv.
int ioctl (void *ns, void *node, void *cookie, int cmd, void *buf, size_t len);
Do special operations to the file. Things like changing a file from cachable to uncachable, or finding out internal information on the file (like the list of blocks used). It's up to you to make your own special control codes and define what they do. The device drivers have a standard set of ioctl codes, so it's best to avoid them (use numbers over B_DEVICE_OP_CODES_END/9999). You may want to pass the ioctl operations down to the hardware device you are using, perhaps only if the file is the root node.
int setflags (void *ns, void *node, void *cookie, int flags);
Implements part of the Posix fcntl function, which changes the file's open mode. For example, making a read-only file writeable. The real fcntl is also used for file locking (useful for multiuser databases), changing between blocking and nonblocking IO operations, ownership changes, and a few other things.
int rstat (void *ns, void *node, struct stat *);
Reads file status information. Things like owner, date modified, size, and whether this is a file or directory. See stat.h for details. For POSIX symbolic links (because there is no inode, just a directory entry), the permissions are those of the directory it is in, though it seems that all rwx flags are set when listed with ls under Linux and all rwx flags are clear under BeOS's devfs and set under BeOS's BFS. I don't know if the links field includes open file references, but I expect it does not, and it just counts the number of directories that the file is in. Remember to set the st_mode flags to also identify whether the item is a file or directory, otherwise it won't work!
int wstat (void *ns, void *node, struct stat *, long mask);
Write the file status. This is used to change some of the properties of the file. The mask specifies which fields to change, which can be a bitwise combination of WSTAT_MODE, WSTAT_UID, WSTAT_GID, WSTAT_SIZE, WSTAT_ATIME, WSTAT_MTIME, WSTAT_CRTIME). That implies you can truncate a file by setting the size smaller, or make it larger (contents may vary, it's up to the user to overwrite it with zeroes if they don't like what's already in the disk sectors, so if you are paranoid, overwrite your files with zeroes before deleting them). Of course, a few security checks before changing the file's owner / group / mode bits could be appropriate here.
int fsync (void *ns, void *node);
Flush all cached data for this file/directory/thing to disk. Yup, another excuse for being allowed to open directories like files. Return only when it is completely written.
int select (void *ns, void *node, void *cookie, uint8 event, uint32 ref, selectsync *sync);
Usually used for network operations, to set up a socket for receiving incoming calls. See also notify_select_event, which I expect is called by your file system when something happens. Device drivers have a similar call, which is normally not implemented (maybe it's for the future BONE networking system?). There are constants SELECT_READ, SELECT_WRITE, SELECT_EXCEPTION which may be relevant as event types, and correspond to the arguments for select() in /boot/develop/headers/be/net/socket.h.
int deselect (void *ns, void *node, void *cookie, uint8 event, selectsync *sync);
Probably turns off notification of incoming calls on a network socket.
int initialize (const char *devname, void *parms, size_t len);
Possibly used for formatting a disk with an empty file system. I can't find a user equivalent call to trigger this callback. I even tried mount() and cycled through all the flag bits. Guess it's not implemented in BeOS.
int mount (nspace_id nsid, const char *devname, ulong flags, void *parms, size_t len, void **data, vnode_id *vnid);
Starts up a file system. After some user program calls the Posix mount (const char *filesystem, const char *where, const char *device, ulong flags, void *parms, int len) function (often done when a disk is inserted, possibly automatically by a daemon looking for new disks), the OS loads your file system add-on if it isn't already there, and calls your mount function. The nsid is used by the OS to identify this particular file system instance, and should be saved for later use. Devname is the path name to the disk device to use, specified by the user program. The flags are also user specified bits, B_MOUNT_READ_ONLY (1) being the only documented one. The parms and len (length of parameters) are also provided by the user program, and can be anything you want. Once you have finished setting things up, return a pointer to your instance specific semi-global data in *data. By semi-global data, I mean a struct holding things which are specific to the file system instance as a whole (like disk space remaining, free list, volume name, etc). Since there could be more than one instance of your file system running (one instance for each disk volume/partition mounted using your kind of file system), allocate a new semi-global data area for each instance to keep them separate. This semi-global pointer will be passed in future calls in the "void *ns" field so you know which instance to use. The final vnid argument is used for returning the vnode_id of the root directory. You need to set up the root vnode in BeOS by using new_vnode and initialising it as if read_vnode had processed it (no, it doesn't call read_vnode, but it will call write_vnode when it unmounts), otherwise we get a kernel panic about root being NULL. You'd think it would do its own get_vnode when it needed the root, but it does not. You need to force feed it. Return 0 if everything went OK, otherwise return a negative number error code (see support/Errors.h for a list).
int unmount (void *ns);
Unmount the given file system instance. Only gets called when no open files are using your file system. Usually this means flushing any cached data to the media, closing any device drivers used, then returning when it is complete. You should also deallocate your semi-global data here too. BeOS does a write_vnode for all outstanding vnodes (including the root directory) before calling this function.
int sync (void *ns);
Presumably implements the UNIX sync command, which flushes all cached data to the disk and returns when it has confirmed that the data has been written by the drive. Ideally the disk now contains a consistent set of data (no need for a file system check/validation). Often called before doing something which may crash the computer (like changing a laptop's power saving mode).
int rfsstat (void *ns, struct fs_info *);
Reads information about the file system instance in the fs_info structure. That includes things like the amount of free space and volume name. Return an error code or 0. This needs to be implemented or the OS will crash right after the mount, as many BeOS daemons are busy using it. Return an empty string for the device name if the file system doesn't run on a physical device. You have 8 letters for the file system name (otherwise the "df" output looks bad). The BSD statfs documentation says that fields undefined for a particular file system will be set to -1. OpenTracker will ignore file systems that aren't persistent (can't do queries on them etc), so pretend that the "persistent" flag really means "invisible to the user".
int wfsstat (void *ns, struct fs_info *, long mask);
Used for setting (W for writing) information about the file system. Currently just the volume name, with WFSSTAT_NAME (1) for the mask. Maybe also the file system read-only flag too? I'll step in and suggest defining WFSSTAT_USEFLAG_READONLY (2) which will copy the read only bit (B_FS_IS_READONLY) from fs_info.flags into the active setting used by the file system. If you change to read only mode, open files can still be written to, but new opens are required to be read-only.
int open_attrdir (void *ns, void *node, void **cookie);
Starts examining the list of attributes associated with a file (or directory, or even a symbolic link if your implementation allows that). Otherwise similar to open_dir in operation. Most of these operations are similar to directories, except that instead of file names in a directory you are dealing with attribute names of a file.
int close_attrdir (void *ns, void *node, void *cookie);
Stop accessing the attribute list, but don't free the cookie just yet (other threads could be reading the list).
int free_attrdircookie (void *ns, void *node, void *cookie);
The last thread has finished reading the list of attributes, you can now really close it. Deallocate the cookie and otherwise clean up here.
int rewind_attrdir (void *ns, void *node, void *cookie);
Go back to the first attribute in the list.
int read_attrdir (void *ns, void *node, void *cookie, long *num, struct dirent *buf, size_t bufsize);
Get the names of a bunch of attributes. Much like readdir() does. Remember to use dirent.d_reclen correctly!
int remove_attr (void *ns, void *node, const char *name);
Deletes the attribute immediately, no waiting for open files to close, as there are no file handles to attributes. Also removes the attribute value from the index with the same name.
int rename_attr (void *ns, void *node, const char *oldname, const char *newname);
Changes the name of the attribute. Not implemented by BFS. If you implement it, also remember to update the indices (remove the old name/value and add the new name/value). Watch out for renaming magic attributes.
int stat_attr (void *ns, void *node, const char *name, struct attr_info *buf);
Returns information about the attribute, just size and type.
int write_attr (void *ns, void *node, const char *name, int type, const void *buf, size_t *len, off_t pos);
Write some bytes to the value of the named (by name) attribute of the given file (by shadow vnode). If the attribute doesn't exist, it is created. The type is a 4 byte code, usually as 4 ASCII characters, that identifies how the user interprets the attribute (see TypeConstants.h for definitions of things like B_FLOAT_TYPE ('FLOT'), B_TIME_TYPE ('TIME'), B_RECT_TYPE ('RECT') and so on). If the attribute is indexed, the index with the same name has to be updated to include the file's vnode_id and attribute value after the value is changed. Because attributes are stateless, there is no file handle, and all writes provide the position to write at, as well as the usual data buffer and length. You may wish to deny writes to magic attributes like "size", "name", "last_modified".
int read_attr (void *ns, void *node, const char *name, int type, void *buf, size_t *len, off_t pos);
Reads data from the attribute, starting at the given byte offset, up to the specified length in bytes, into the buffer. When successful, sets the length to the amount actually read. Probably best to set the length to zero on failure. If the attribute doesn't exist returns B_ENTRY_NOT_FOUND. If it is the wrong data type, returns B_BAD_VALUE. Other access errors are as usual. Actually, the BeBook says it ignores the type and offset. But it shouldn't!
int open_indexdir (void *ns, void **cookie);
Start reading the list of names of indices, much like opendir reads names of things in a directory.
int close_indexdir (void *ns, void *cookie);
Finished reading the list of indices, but don't free the cookie yet as other threads may still be reading.
int free_indexdircookie (void *ns, void *node, void *cookie);
The last thread has finished reading the list of indices, you can now really close it. Deallocate the cookie and otherwise clean up here. Note that node is NULL (there is no VNodeID for the index directory under BFS) but the argument is still present.
int rewind_indexdir (void *ns, void *cookie);
See rewinddir, except that this is for the names of the indices.
int read_indexdir (void *ns, void *cookie, long *num, struct dirent *buf, size_t bufsize);
See readdir, except that this is for the names of the indices.
int create_index (void *ns, const char *name, int type, int flags);
Create a new index for attributes with the given name. Fails if it already exists. Futhur changes to attributes with the same name will add/remove the file's vnode_id and the attribute value from the index. Unfortunately attributes from pre-existing files aren't automatically added to the index under BFS, though you could do that in your file system. I've (AGMS20020630) defined the flag B_CREATE_INDEX_WITH_REBUILD (0x00000001) to enable filling the index with the files that should be there. The type specifies the type of the index for sorting purposes. It doesn't exactly have to match the attribute type used in files, so you could have MIME string attributes (B_MIME_STRING_TYPE) using the generic string (B_STRING_TYPE) for the index type (a convenient idea, otherwise the file system would have to find out about new types somehow). Note that the PFSDWTBFS book says that they should match, but I don't think it's enforced. BFS supports B_INT32_TYPE, B_INT64_TYPE, B_FLOAT_TYPE, B_DOUBLE_TYPE, B_STRING_TYPE. I've also added a related flag, B_CREATE_INDEX_MULTIPLE_KEYWORDS (0x00000002), which will split up strings into multiple keywords (separated by spaces or control characters) and treat fixed size types like arrays of the base type. Each of the multiple keywords gets added to the index just like a single attribute would have been. This does mean that the query system users have to put up with getting the same file several times as a query result.
int remove_index (void *ns, const char *name);
Immediately deletes the given index. Zap! Probably also blows away any currently open queries using that index, or they switch to unindexed mode.
int rename_index (void *ns, const char *oldname, const char *newname);
Changes the name of the index. Presumably you could do this if you renamed an attribute in all your files. But then the attribute rename operation would enter the attribute in the new index anyways. Not implemented under BFS and not useful to implement elsewhere. Might kill open queries using the old index name.
int stat_index (void *ns, const char *name, struct index_info *buf);
Find size and type of an index, also gives ownership and a few datestamps. Returns an error if it doesn't exist (useful for quickly checking if a particular index exists).
int open_query (void *ns, const char *query, ulong flags, port_id port, long token, void **cookie);
Use this to start reading the list of files which match a query string. "query" is the string which describes the files you are looking for using the query language (BFS constructs a parse tree from your string, stores it in the cookie, and uses it to match or reject files).
If it is a live query, the port is specified and notifications are sent there (your file system calls send_notification to do that) when new matching files appear and old files disappear. The token is provided by the end user, and is returned with the notification message sent to the port. Under the hood, your file system monitors all attribute changes mentioned in the query string, and when any of them change, re-evaluates the query on that new/deleted file to see if a notification needs to be sent. So, having lots of live queries slows things down a bit. Files already in existance get sent to the user via read_query, not notifications. Files that change the indices after open_query get sent as notification messages (if it would change the value reported by read_query, even if the caller hasn't gotten that far in read_query), and maybe also show up in read_query if the underlying iteration of read_query hasn't reached that file yet. For example, a new file is created which matches the query, the end user may get informed twice: in read_query (if the iteration hasn't gotten that far in alphabetical or whatever order it is using) and definitely once as a notification.
The query language consists of comparison expressions joined by "&&" (logical and) and "||" (logical or) operations, with round parenthesis "()" to control grouping. && has grouping precedence over || if you don't override it with parenthesis (in general, the same operator precidence is used as in the C programming language). You can also put a "!" (logical negation) before the expression to reverse its meaning. The comparison expression has three parts:
AttributeName Relation Value
The AttributeName is just the name of an attribute, which is case sensitive, and for speed should match an index. The Relation is one of "==" (equal), "=" (equal), "!=" (not equal), "<" (less than), ">" (greater than), ">=" (greater than or equal), "<=" (less than or equal) which have the usual meanings. The value is a string which gets converted to a number if the attribute is numeric (dates are big numbers: number of seconds (or microseconds - depends on index) since January 1, 1970), or gets converted to a regular expression if the attribute is a string. Note that hex values (0xabcd1234) can be specified even for floating point numbers. The same attribute on different files can have a different data type, making comparisons more interesting. If the attribute isn't present, the file won't be listed (so Thing!="*" returns zero files, it does not list the files without the Thing attribute). Regular expressions use "*" to stand for any number (including zero) of any character, "\" to escape the next character, "?" to match any single character, [ad-f] to match a single character specified by the actual characters or a range, [!...] to match a single character not of the specified ones. The pattern matching works on Unicode characters, so that if you store the file name as UTF-8 multibyte characters, you need to be careful about what counts as a single character for "?" and [a-z] patterns. For example a "?" will match "事", which is some Japanese character (shows up as the 3 bytes E4, BA, 8B in UTF-8 format).
UnaryOperator AttributeName
As a new (January 10 2003) feature in AGMSRAMFileSystem, I'm adding a couple of unary operators <> and >< to find all files that have or don't have an attribute. Use "<>AttributeName" for files with the attribute and "><AttributeName" to find files that don't have it. Note how >< looks like an X in shape, kind of implying crossing out of something, so I use it to find files that do not have the attribute. Rather than using new symbols, < and > are reused so that the list of reserved symbols doesn't change.
After a bit of experimentation (finding out you can put quotes around attribute names etc), I've come up with this description of the language. The order of the expansions is important, if the first alternative fits (such as QuotedString vs UnquotedString in the definition of String), use it before trying the remaining alternatives.
<String> ::= <WhiteSpace> ( <QuotedString> | <UnquotedString> )
<WhiteSpace> ::= { <Space-like characters> }
Meaning zero or more space or other space-like characters (tabs, newlines, and others - see the standard isspace() function). I'm not going to insert <WhiteSpace> all over the place in this language description, but it is there in places you'd expect. In practice, the query will be converted to tokens before being parsed, at that point white space is removed and the parser doesn't have to worry about it.
<QuotedString> ::= "\"" { <NonQuote> | "\\\"" } ( "\"" | "\0" )
Meaning a quote character followed by zero or more of (non-quote character or backslash quote (\") which puts a quote character in the string) followed by a quote or end of string. The string will contain the text between the quotes (including white space), with the backslash quote combination for entering one quote into the string. Other backslash combinations aren't special; they appear in the string as the backslash and the following character. You can leave off the trailing quote if the QuotedString is the last thing in the query.
<UnquotedString> ::= [ <NonSpecial character> ]
Meaning one or more of any character except (, ), !, |, &, <, >, =, white space. The string will be all characters up to the next character that might have special meaning. Yes, quotes just are ordinary characters; you can use name==*"* to find all files with a quote mark in their name. No, it's not smart enough to look ahead, so a "=" will end the string even if the next character doesn't make a "==" out of it. You may want to ignore that BFS simplification, but it doesn't matter since most queries use quoted strings.
<Expression> ::= <SimplerExpression> | <SimplerExpression> "||" <Expression>
Meaning that the || operator has lowest precedence. Also implies that a sequence of || operations will be evaluated left to right, though query optimization may change that order.
<SimplerExpression> ::= <Term> | <Term> "&&" <SimplerExpression>
Meaning that the && operator has the second lowest precedence and associates left to right.
<Term> ::= <Comparison> | "(" <Expression> ( ")" | "\0" ) | "!" <Term>
Meaning that ! (the not operator) binds tightly, so !a<b&&c<d is equivalent to (!(a<b))&&(c<d). To save on complexity, the parser can remove ! operations and reverse the logic of the items being notted, so that statement would be equivalent to (a>=b)&&(c<d). Use De Morgan's Law to apply not to logic operations. Parenthesis let you group subexpressions. You can leave off the trailing ")" for expressions that end at the end of the query string.
<Comparison> ::= <String> <RelationalOperator> <String> | <UnaryOperator> <String>
Meaning that the basic comparison operation is done between a named attribute (the first string) and some sort of value (the second string). The second string will be converted to the appropriate data type to match the attribute (if it is an integer attribute, the value string will be converted to an integer, same for floating point attributes, date attributes, etc). If the conversion fails, it will use some default as the value (usually zero). In the unary version, the string is the name of an attribute.
<RelationalOperator> ::= "<" | "<=" | "==" | "=" | "!=" | ">=" | ">"
As a special case for string comparisons, if the value is a pattern (containing * [ ? or \) and the comparison is == or = or != then pattern matching will be done rather than the usual relational comparison. BFS overdoes this and gives no results if you use less or greater than with a pattern. If you are using multiple keyword indices, then the test succeeds if any one of the keywords succeeds.
<UnaryOperator> ::= "<>" | "><"
These are used for seeing if an attribute exists or if it does not exist.
int close_query (void *ns, void *cookie);
Finished reading the list of files matching the query, but don't free the cookie yet as other threads may still be reading. Other threads blocked while reading queries should be unblocked and return a suitable error code.
int free_querycookie (void *ns, void *node, void *cookie);
The last thread has finished reading the query results, you can now really close it. Deallocate the cookie and otherwise clean up here. Watch out for the meaningless node parameter.
int read_query (void *ns, void *cookie, long *num, struct dirent *buf, size_t bufsize);
Gets a list of files which match the query. Returns the number of files being described in *num, and the information about them as variable sized dirent structures in the buffer you provided (which is bufsize bytes long). To get all the files matching the query, keep on calling read_query until no more are returned (should just return zero in *num, not an error code). Note that BeOS will crash if you don't provide an implementation of this function - there doesn't seem to be a stub function that just returns an error code like it does for all other API calls.
vnode_ops fs_entry;
This is an array of function pointers to your file system functions. The add-on loader will look for a global variable named "fs_entry" and use that to get the pointers, and start your file system via the mount() call. The pointers are also copied to the name-space record which the OS uses to keep track of active file systems (not sure if it copies just the array address or all the function pointers).
int32 api_version;
Another magic global variable in your code which the add-on loader will look for to decide how to use your file system. The name must be "api_version" and the value must be preset (an initialized global variable) to the file system API version you are implementing, usually B_CUR_FS_API_VERSION.
int new_path (const char *path, char **copy);
Kernel function (implemented by the OS, not you) to allocate memory and copy a string. The new memory is owned by the OS, not your file system. Used during Walk operations to copy a symbolic link string into the OS memory space.
void free_path (char *p);
Presumably frees the memory allocated with new_path. Normally not needed as the OS will free strings given to it during Walk.
int notify_listener (int op, nspace_id nsid, vnode_id vnida, vnode_id vnidb, vnode_id vnidc, const char *name);
Call this OS provided function to tell it about changes in the file system. The OS will then pass on the information to interested programs. op is the event, which can be one of B_ENTRY_CREATED, B_ENTRY_REMOVED, B_ENTRY_MOVED, B_STAT_CHANGED, B_ATTR_CHANGED. The OS takes care of generating B_DEVICE_MOUNTED and B_DEVICE_UNMOUNTED so you shouldn't specify them.
Marco Nelissen provided a description of the arguments for each notification operation:
B_ENTRY_CREATED: vnida is the directory containing the file in question, vnidb is unused (0), vnidc is is the vnode_id of the new file (or directory or symlink etc), and name is the name of the new file/dir/symlink/etc.
B_ENTRY_REMOVED: vnida is the directory containing the file in question, vnidb is unused (0), vnidc is is the vnode_id of the file, and name is NULL or the name of the file if you know it (probably not used as the BeBook C++ API doesn't specify it as being available). Can happen as part of unlink, rmdir and rename (if overwriting a file).
B_ENTRY_MOVED: vnida is the 'old' directory, vnidb is the 'new' directory, vnidc is the file, and name is the new name of the file. Called as part of renaming a file/dir/link/etc.
B_STAT_CHANGED: vnida and vnidb are unused (0), vnidc is the file in question. Usually you call this in your wstat function, and when the file is closed. You can call it more often when the file size is changing as it is being written, but not too often otherwise it wastes the user's time (perhaps once per second), and definitely when the file is closed and the final size is known.
B_ATTR_CHANGED: basically the same as B_STAT_CHANGED, except that vnidc is the vnode_id of the attribute. If you wish, you can put the name of the attribute in the name field, though it currently isn't passed to the user.
void notify_select_event (selectsync *sync, uint32 ref);
Kernel function probably used for the device driver select functionality (see Drivers.h). I think it has to do with waiting for incoming network connections and other such asynchronous things, but I'm not sure how it works or what it does. Maybe you call it when you get a new connection?
int send_notification (port_id port, long token, ulong what, long op, nspace_id nsida, nspace_id nsidb, vnode_id vnida, vnode_id vnidb, vnode_id vnidc, const char *name);
This is a kernel function to notify a particular program about a live query change (see query_open). Normal file system notifications use notify_listener instead (I'm guessing that it may internally call this function for you as part of notify_listener). The live query messages have what == B_QUERY_UPDATE, opcode will be either B_ENTRY_CREATED or B_ENTRY_REMOVED, and the rest is similar to notify_listener.
int get_vnode (nspace_id nsid, vnode_id vnid, void **data);
This function is implemented by the OS. You call it when you want to convert a vnode_id into an actual vnode (so you can find out info about the file, or open it, and otherwise use it). In turn, the OS will call the file system's read_vnode if it hasn't got a cached copy of the vnode pointer. I suspect that it returns your shadow vnode pointer in *data. It also increments the reference counter for the vnode. When you are done with the vnode, call put_vnode. Watch out for recursive calls into your file system!
int put_vnode (nspace_id nsid, vnode_id vnid);
This function is implemented by the OS. You call it when you have finished working with a vnode (see get_vnode). It decrements the vnode's refererence count kept by the OS.
int new_vnode (nspace_id nsid, vnode_id vnid, void *data);
This function is implemented by the OS. It allocates a new vnode record inside the OS, kind of like using get_vnode (which then calls your read_vnode) but more directly. Often used as a convenience when you know that the OS will be asking for a vnode_id soon and you already have the file data for it. The shadow vnode pointer goes in *data and the corresponding vnode ID goes in vnid. I'm not sure but I expect it to start with the vnode record reference count set to 1, or 0? Yup, it seems to start out with a reference count of 1, and also assumes you have done the read_vnode functionality because it directly calls write_vnode when that vnode is no longer needed.
int remove_vnode (nspace_id nsid, vnode_id vnid);
This is the OS call, not the similarly named file system hook function. Tells the OS to delete the vnode record once the reference count falls to zero. Essentially just sets a delete flag on the vnode. Typically you would call this after deleting a file from a directory (the file no longer has any directories refering to it). Then when the final user of the file goes away, the vnode gets deallocated and your remove_vnode gets called rather than the usual write_vnode before vanishing.
int unremove_vnode (nspace_id nsid, vnode_id vnid);
This function is implemented by the OS. Clears the delete flag on a vnode. In case you changed your mind.
int is_vnode_removed (nspace_id nsid, vnode_id vnid);
This function is implemented by the OS. Get the delete flag state from a vnode record.
int mount (const char *filesystem, const char *where, const char *device, ulong flags, void *parms, int len);
This variation of the standard POSIX mount command, found in unistd.h, is used for disk volume mounting (loading the file system code, reading the file system headers from the disk, and adding the disk volume to the system's directory tree). "filesystem" names the file system, "bfs" for example, and should match one of the file-system add-on library names (see the /boot/beos/system/add-ons/kernel/file_systems directory and corresponding user provided /home/config/add-ons/kernel/file_systems directory). "where" specifies the directory where you want the disk volume to appear under, often "/volumename" is used, but it can take over any existing directory (existing contents will be inaccessible until you unmount the file system). "device" is the path to the raw device used by the file system, usually a disk partition pseudo-device, specify NULL for file systems that don't have an underlying device (an empty string isn't good enough). The flags can be set to B_MOUNT_READ_ONLY (1) to disable write access to the file system, or 0 for normal mounting, there may be an undocumented flag which will initialize the drive with a new empty file system. "parms" is a pointer to a user provided buffer with parameters that the particular file system type understands (could be a string, or binary structures), and len is the length of that buffer.
int unmount (const char *path);
The user callable POSIX-like function from unistd.h which undoes the mount operation. The file system which was mounted at the directory specified by "path" is unmounted (files closed, data flushed to disk). May fail if there are open files still using the disk volume (close them and try again later).
Suggested order of implementation
- mount : Need this to get started.
- unmount : For symmetry. Can't very well have a way to mount without a way to unmount too.
- rfsstat : Used by daemons such as TrashWatcher, Desktop, main_mime, Deskbar to examine mounted volumes; will crash if not present.
- walk : TrashWatcher starts looking for ".", ls uses it to find files in a directory as part of stat?
- rstat : Lots of use, by daemons (DirPoller, Desktop, TrashWatcher) and even "ls" when it lists files (for each file or dir's size, mode flags, dates, etc).
- open, close, free_cookie : Used for both files and directories.
- opendir, readdir, free_dircookie : As you would expect.
- read_node, write_node : Once you have more than just a root directory.
- read, write data : Self-explanatory.
- rename : Implementing this will usually force you to redo rmdir and unlink.
- attributes : Meta-data to be handled. People files are nothing more than a series of attributes stored in the inode of the file.
- notification : Need some way to notify the system (and potentially the user).
- indices and queries : Indices make querying in the future faster. Storing queries would be good for convenience factor.
And that is all.
- Alex
IndexPage | TableOfContents