231: Filesystem: Metadata Holds More Information.


Manage episode 203380807 series 125918
By Take Up Code, Take Up Code: build your own computer games, apps, robotics with podcasts, and live classes. Discovered by Player FM and our community — copyright is owned by the publisher, not Player FM, and audio is streamed directly from their servers. Hit the Subscribe button to track updates in Player FM, or paste the feed URL into other podcast apps.

Metadata provides information about your files.

You might hear the prefix meta applied to other things. It really just adds a higher level to whatever it describes. So metadata is information about your data. It’s data that describes your data in the files.

Metadata includes things such as when a file was created, when was it last changed, who is the author, etc. And depending on the type of file, there could be more. A picture or image file could have information about where and when the picture was taken, what camera was used, the dimensions, etc.

Listen to the full episode for more information about where a filesystem might store these properties. This includes a description of a forking filesystem and alternate data streams. You can also read the full transcript below.


You might hear the prefix meta applied to other things. It really just adds a higher level to whatever it describes. So metadata is information about your data. It’s data that describes your data in the files.

If you refer to something like metaplanning, then this refers to something about how you plan. Maybe before creating an actual plan to do something, you need a higher level plan for how to create the real plan.

In general, going meta just means to take something to a higher level or to go beyond. For example, metathinking is thinking about how our minds think.

Back to filesystems, metadata includes things such as when a file was created, when was it last changed, who is the author, etc. And depending on the type of file, there could be more. A picture or image file could have information about where and when the picture was taken, what camera was used, the dimensions, etc.

You can use this information to gain more insight into your files. And applications can use it to give you more ways to organize your files. So instead of just looking at a list of files and folders sorted by name, you can view them sorted by their original author name.

The extra information is also called either properties or attributes. Sometimes you might also hear the term extended attributes.

This metadata needs space somewhere. I mean, remembering if a file should be read-only needs at least a single bit to be set to either 0 or 1 and that bit needs to be saved somewhere. The same thing with the name of the author.

But before we get into where this is stored, I want to clear up something about read-only. This sounds a lot like security. Should you be able to write to the file or just read. Well, it’s not really very secure. Sure the concept of allowing some people to write while others can only read is definitely important. But that’s not what this attribute controls. The read-only attribute doesn’t really protect anything. It’s more of a reminder to applications that a particular file should not be changed or deleted.

Think of it like a sign placed on a door that says “No Entry” and compare that with an actual locked door. Anybody can ignore the sign and enter. But most people will behave and avoid going through the door. A lock is much more secure and will stop people from entering unless they either have the key, pick the lock, or break the door.

So remember that almost any security system can be broken. My dad told me when I was young that locks were only there to keep honest people out. So from that sense, there’s no difference between a sign and a lock for a person with the skill and motive to break through. A sign is a really weak form of security lust like the read-only attribute.

Okay, back to attributes and where they get stored. First, we have to consider that some attributes are so common to a filesystem that they can apply to all the files. These are the things most often called attributes.

They tend to take up a fixed amount of space. Even the traditional Unix style security information could be implemented by a filesystem as attributes.

The reason a fixed amount of space is important is then the attributes can fit into simple data structures that get stored into known locations in the file.

Let’s say that we need a total of 100 bytes for a set of attributes in our filesystem. We could put these 100 bytes at the beginning of every file. When an application opens a file to read its contents, the filesystem will just skip over the first 100 bytes as if they did not exist. Only if an application asks for one of the attributes would the filesystem read the attribute from the beginning of the file and give the value to the application.

Any time the filesystem needs to authorize access to a file, it can also read the security information from the first 100 bytes and make the decision to allow access or not. These first 100 bytes would be under the complete control of the filesystem and as far as any other application is concerned, they don’t exist.

We could even extend this design to allow for the space needed to increase. Maybe one of the properties is the size of the properties. This gets harder to manage but could work. At least for the well known properties.

It would be harder to make this system work for special properties such as the focal length used to take a picture. As the properties become more specific to a type of file, then it makes more sense to let the applications that best understand the files to manage their own metadata. This clashes with an approach where the metadata storage location is kept hidden away.

Another option to consider is to store metadata outside of the file. This could be in a separate file or maybe even in a central database where all the metadata is stored for all the files on the computer.

There are tradeoffs for any engineering design decision and this approach has advantages and disadvantages.

One good thing is that you don’t have to worry about where to store extra information inside the file anymore.

But a bad thing is that the information that describes a file is now kept somewhere else where it could get lost or mixed up with information about a different file.

Another good thing is that because the information is in one place, then we can save space by not needing to repeat the same metadata all the time. If all your pictures are taken with the same camera with the same settings, then this information can be shared with all your picture files.

I tend to think it’s good to keep information about a file together with that file. Even if this uses more disk space. But that still leaves us with a problem of where to put all this metadata so that it stays with the file, doesn’t interfere with the actual data in the file, can grow as big as needed for special properties, and can remain small and fixed for common properties.

The solution is the idea of a fork. Think about this for a moment, have you ever seen a fork with just a single tine? Those are usually called skewers or even toothpicks. Even sporks have multiple pointy parts.

Well, a forking filesystem allows you to add multiple places in a file where information can be stored. Usually, one of these is considered to be the main stream. And the additional tines on the fork are alternate streams.

The idea of files and streams work well together because usually, we start reading a file from the beginning and advance through the information until we reach the end. It’s like watching the information pass by like watching water pass by in a stream.

Now, some filesystems might not support alternate streams at all. Some might support just a few for each file. And some might let you create as many alternate streams as you want and give each stream its own name.

What happens if you try to copy a file that contains alternate streams to a filesystem that doesn’t support streams? Well, hopefully, the filesystem will warn you first because usually, this means only the main stream will be copied. You can lose all the extra information by moving to a different filesystem.

You can also lose information by using applications that ignore alternate streams. Is it the filesystem’s fault if you use some utility application you found somewhere to copy a file and that application knows nothing about alternate streams? What’s the application going to do? It’ll open the file with all the streams and ask the filesystem for just the main stream. Since that’s all it knows to ask for, then that’s all it gets. Then the application will open another file and write all the information it reads from the original file. You end up with a copy. But only a copy of the main stream.

When you’re writing code to work with filesystems, just be aware that there could be information sitting inside a file that you won’t get unless you know how to ask for it. This could change from one filesystem to another. So going into the details here probably won’t help you any. But the awareness that multiple forks or alternate streams could exist should be enough to get you to know what to look for.

By the way, this is how Internet Explorer marks a file that was downloaded from the internet. When it saves the file to your disk, it creates an alternate stream inside the file where it stores the source of the file.

Then anytime you try to run a file, the shell looks for this alternate stream by a given name that it knows was used by Internet Explorer. If it finds such a stream and the stream identifies the file as coming from an untrustworthy location on the internet, then the shell can stop you from running the application. Or maybe it can warn you. You can get rid of this warning by removing the alternate stream. Of course, you might have to write your own application for this.

296 episodes