YAFFS2 Implementation Notes
See YAFFS2 for general information about the YAFFS2 file system.
- 1 General File System Reconstruction
- 2 Finding Deleted/Older Versions of Objects
- 3 Possible Future Work
General File System Reconstruction
As outlined in the YAFFS2 page, the file system relies on the 'spare area' of each flash memory page to store information about what file system chunk that is stored in that page. The YAFFS2 spec does not specify though the exact format of how that data is stored because devices can choose how to use the spare area for other things, such as check sums. Further, there is nothing in the file system that stores how the spare area is organized, so tools must guess.
There are two things that tools must guess: - Page size and spare area size - Layout of spare area
TSK assumes that all pages are 2048-bytes and spare area is 64-bytes. It knows about several spare area layouts, but does not know all of them. We need to add the ability to pass in the layout details if you have reverse engineered them.
Determining the format of the spare area
The YAFFS2 specification does not specify exactly where the data in the header should be stored or exactly how the extra fields should fit in. In practice we've seen this format:
- 4 byte sequence number
- 4 byte object ID (with the object type in the high four bits if the header flag is set in the next entry)
- 4 byte chunk ID OR high bit set (header flag) and the parent ID in the rest
- 4 byte number of bytes
At present we've seen these 16 bytes start at offset 0, offset 2, or offset 30. TSK attempts to determine where the fields are by reading in a reasonable number of intialized spare areas and then doing some tests on each possible offset:
- Sequence numbers should be the same for all chunks in a block
- Sequence number shouldn't be null or 0xffffffff
- Object ID can't be zero
- If we have other options, we'd prefer the sequence number not to start with 0xff
These tests could certainly be improved upon.
Constructing the current version of each object
- Read in the sequence number, object ID, and chunk ID from the spare area of each chunk and record the offset of the chunk
- Make a list of chunks for each object ID and sort it by sequence number and then offset, resulting in a chronological list of chunks for each object
- The current header and data chunks can then be found by reading backwards through the list
The inode for the current version of an object is its object ID.
Constructing the file hierarchy
- The root directory always has object ID 1
- To find the children of a directory, search over all objects to find those with the appropriate parent ID
Finding Deleted/Older Versions of Objects
As described in the previous section, for each object we create a chronologically ordered list of chunks with that object ID. The current version of an object is created by starting at the end of the list and reading backwards, but we can start at any point in the list and read backwards to create an older version of the object. Object IDs tend to get reused frequently, so in addition to multiple versions of the same file we also expect to see old files that have been deleted.
The question is, where do we start these versions? Starting at every chunk in the list would produce way too many files. Even only starting at header chunks seems like it would create more versions than could reasonably be worked with. For now, we base versions around changes in the sequence number and create them as follows:
- A version has a number, a pointer to its most recent header, and a pointer its the most recent chunk
- Go through the sorted list of chunks adding each to the current version (i.e., updating the most recent header and chunk pointers) until the sequence number changes
- Note that if we find an unlinked/deleted header we don't record it in the header chunk pointer unless we have no previous header. These headers give us no information and can cause us to entirely miss files created and deleted in one block, as in the following example:
Obj id Seq num Offset Type Parent Name 000005f9 00001b3c 07229a00 file 00000004 deleted 000005f9 00001b3d 07240d40 file 0000032e im.db-journal 000005f9 00001b3d 07245780 file 0000032e im.db-journal 000005f9 00001b3d 07246800 file 0000032e im.db-journal 000005f9 00001b3d 0724b240 file 00000003 unlinked 000005f9 00001b3d 0724ba80 file 00000004 deleted
- Save that version and create a new one, incrementing the version number. This new version will start with the older header pointer as its most recent header. A few exceptions:
- If we're looking at a directory (which will only have a header chunk) and its name hasn't changed, don't start a new version. Multiple copies of the same directory name don't give us much information
- If we never found a header for the previous version, don't start a new version. We can't do anything with a version with no header
Inodes and filenames
Inodes for older versions are created using the object ID and version number. A version number of zero always returns the current version of the object. To avoid name conflicts, non-current versions have their version number and object ID appended to the file name.
r/r 764: pvcodec.txt r/r * 2360060: pvcodec.txt#764,9 r/r * 2097916: pvcodec.txt#764,8 r/r * 1835772: pvcodec.txt#764,7
Determining allocated/unallocated status of versions and chunks
We consider a version of an object to be unallocated if it is not the most recent version or if the most recent header block is a deleted block. We consider a chunk to be allocated if it is part of an allocated version of an object. Since each chunk contains the object ID it belongs to, linking chunks with objects to determine their allocated status is fairly simple.
Possible Future Work
- Improve the spare area format detection
- Add parameters to allow a user to set the spare area format and page/spare size
- Consider possible improvements to how versions are created, maybe allowing different algorithms based on parameters