You can’t always tell what’s inside a file from its extension. In this post we’ll delve into the contents of JFIF/JPEG files.
JPEG is often used as a catchall term for both the encoding (whether lossy JPEG compression, as we discussed in a previous post or any of the other options, such as lossless JPEG compression) and the file itself, which contains JPEG-coded data. Things get even more confusing, as file extensions associated with JPEG-coded data can range from .jpg to .jpeg to .jfif, and beyond. Most software applications don’t rely on file extensions to read files in order to determine how to handle the data, relying instead on magic numbers, which are format-specific indicators, and other embedded information. However, occasionally we might find an application that automatically decodes files labelled .jpg but not files with a .jfif extension.
We can understand what is in a file (regardless of its file extension) by stepping through the file’s contents. Let’s take a look at what is in a file with a .jpg file extension that was produced using a scanner. We’ll use the free HxD tool (https://mh-nexus.de/en/hxd/). This tool displays every single byte of a file; the bytes are written in hexadecimal format using pairs of numbers 0 through 9 with the letters A through F. In this file we see, reading the top row in black type from  left to right that the first byte is FF and the second is represented by D8.
As we keep looking at the file data one byte at time, we can see that the top line of the data and its translation to the far right are actually JFIF. (The values FF D8 FF E0 and then the later values 4A 46 49 46 00 are the magic numbers for JFIF). So, any application that can decode and display this file is probably not just looking at its file extension (.jpg) but is actually reading and interpreting the header to first find the values FF D8 FF E0 and then 4A 46 49 46 00.
To actually decode JPEG-formatted data, the decoder must look for specific markers in the file, much like it did for the FF D8 FF E0 and then 4A 46 49 46 00. These markers indicate data sections, encoded image data and tables, and other information the decoder needs to interpret and decode the data.
Using Bevara software on the same file we see the markers we expect, including the Start of Image (SOI) marker and the End of Image (EOI) marker, which tell us when the file starts and ends. Two bytes into the file we also see that the application marker (APP) that tells us to look for things like JFIF magic numbers is after the SOI. After that we see a frame marker that tells us that this will be a baseline-format JPEG image, and we see a number of markers that segment off the tables that are needed for decoding (DQT and DHT). Finally, we see the Start of Scan (SOS) marker, which is where the last few parameters and the actual JPEG-compressed data are located in the file.
This is a fairly simple file (there are only 20 bytes for the start of the file and all of the JFIF information). It doesn’t even contain a thumbnail image (which would be stored in the APP section if it were present).
marker FF D8  (SOI) is at location 0
marker FF E0  (APP) is at location 2
marker FF DB  (DQT) is at location 20
marker FF C0  (SOF) is at location 89      
marker FF C4  (DHT) is at location 102
marker FF C4  (DHT) is at location 131
marker FF DA  (SOS) is at location 215
marker FF D9  (EOI) is at location 16578
Now let’s take a look at a more complex file that also has a .jpg file extension, but was produced by a camera:
Again, the first line tells us that this is actually a JFIF file, but some interesting details appear in the second and later rows of the translation column on the right, like “PROFILE”, “Lino”, and “Copyright.” If we look at the markers in the file, we can see that there are many more APP markers than were in the first example:
marker FF D8  (SOI) is at location 0
marker FF E0  (APP) is at location 2
marker FF E2  (APP) is at location 20
marker FF E1 (APP) is at location 3182                                                
marker FF E1  (APP) is at location 3842
marker FF DB  (DQT) is at location 4214
marker FF DB  (DQT) is at location 4283
marker FF C0  (SOF) is at location 4352
marker FF C4  (DHT) is at location 4371
marker FF C4  (DHT) is at location 4404
marker FF C4  (DHT) is at location 4587
marker FF C4  (DHT) is at location 4620
marker FF DA  (SOS) is at location 4803
marker FF D9  (EOI) is at location 2858284
The information contained after each of these APP markers is application-specific, and so can contain whatever the software/hardware designer wants. For example, if we take a look at APP marker FF E2 on the second line, and also look at the translation on the right, we see that this section contains an ICC_PROFILE, which is a set of information about the device that produced the image, such as software versions, color space, and device settings.
If we go even further into the image (see below), we can see that on the line numbered 00003840 right after the application marker FF E1 we have the start of an XMP section, which holds custom metadata.
Importantly, if we use one of the more common software tools to open the file, like Microsoft Photos or Microsoft Paint, there will be very limited or even no indication of the customized metadata:
We can use other tools, like the Microsoft Windows file “Properties” to see some of the information embedded in the file, but this tool doesn’t extract and display all of the information:
This raises several questions: How can the rest of the metadata be retrieved? What is actually in the various pieces of metadata, and is it important? If it is important, how can one be sure that it has been retained during the normalization/preservation process? In our next post we’ll looking more deeply into XMP, EXIF, and ICC_PROFILE, discuss what these metadata formats hold, what they might be used for, and and how to preserve them.
Questions? Contact us at support@bevara.com. We’ll either answer you directly or address it in a future post.  For an overview of how our patented data preservation process works, download our white paper.