Over the past couple of months I’ve been writing a lot about the progress of Lightroom over the 8 years it’s been around. And over that time I’ve been getting to increasingly technical details. This time I want to talk — and to be completely honest rant a little — about one very technical points of the Lightroom catalog and how Adobe stores data.
Also, let me also be up front about this, I don’t, nor have I ever, worked for Adobe. I don’t know precisely why they made the decisions they made, I’d like to believe they had good performance driven reasons to do what they did. At the same time, it’s entirely possible they did what they did because it was faster for them to write or saved them a couple of lines of code, and didn’t anticipate that people would end up with heavily fragmented multi-GB catalogs or that any problems would be masked by better hardware down the road.
In a previous post I lamented about how the unbounded growth of the preview cache folder can be problematic when working on systems with relatively small storage limits or where you intend to push the storage to the limits. Inefficiency, and unbounded growth, seem to the be rule rather than the exception when it comes to the inner workings of Lightroom. This time I’m going to talk a little about the develop history mechanic specifically how Adobe stores the data, and how the data grows and expands the catalog size tremendously.
The Premise of Lightroom’s Develop History
In my last post I touched on how Lightroom stores the develop settings for an image as structured data in the catalog as opposed to altering the pixels in the source image. Space efficient virtual copies are one extension of this mechanism. The second is the ability to persistently save an undo history of all develop steps to an extent and with persistence that’s not really possible in a bitmap editor.
Where a bitmap editor, like Photoshop, has to store every pixel value that’s change to be able to undo those changes, the vast majority of Lightroom’s adjustments could be stored as a handful of bytes. This enables Lightroom to be relatively aggressive in creating history states, and persisting them for the life of the image in the catalog.
All told, between editing an image and exporting or publishing it, it’s not uncommon to have many saved history states for any given image. In my library I average around 15 states for every image; however, because I don’t process every one of my images, that actually means I have a lot of history states for a comparatively few images.
Investigating History States
Unlike the rest of this Lightroom series, this post actually started as a specific investigation into a problem.
One of my friends has been using Lightroom almost as long as I have. Unlike me, though, he shoots a lot more than I do and his catalog is much larger than mine is. One day he called me up and asked me about a strange error he was getting when he started Lightroom; something about not being able to connect to something. If he let Lightroom load for a while, it would go away and Lightroom would work normally. Moreover, if he restarted Lightroom it wouldn’t always come up on subsequent restarts.
Not having seen the actual error, my first guess was that it was maybe something to do with Lightroom trying to connect to Adobe’s servers. The next time I was at his place, I took a look at it and started digging into what could be causing it. I quickly determined that the error had nothing to do with internet connectivity, it was something internal to Lightroom. For some reason my mind immediately jumped to the SQLite subsystem that is the core of the Lightroom catalog file.
The first place I looked was his catalog file; it was approaching 4 GB, and a quick look with a defragmenting tool showed that it was heavily fragmented.
While fragmentation has become much less of a concern with modern filesystems, like NTFS, it can still be a problem if the file is extremely fragmented. In this case, there was a 4 GB file, that didn’t have a fragment bigger than a few 10s of MB. That level of fragmentation, especially paired with spinning disks, created a significant decrease in disk performance and therefore increased loading time — which was what was ultimately causing the error that the server wasn’t responding.
I did a poor man’s defragmentation on the file by moving it to another drive and back. As an aside, it’s necessary to move the file off the file system (or copy it and remove the original) in order to get the OS to write the data out in as few fragments as possible — though be aware this method will not always work optimally.
That seemed to fix his problem, at least temporarily. But it also got me looking at the Lightroom catalog file.
He has quite a few images more than I do, but at the time he didn’t have 4 times as many and his catalog was 4 times larger than mine. At the same time, I have titles and captions on a lot of my images, and an extensive keyword library. While I wouldn’t expect the keywords to take up too much space, the titles and captions are all unique and all are several hundred bytes each. But that doesn’t explain why his catalog was disproportionally bigger than mine with less metadata involved. This suggested to me that there might be something up with what Adobe is storing in the catalog.
This is where that core open source technology SQLite comes in. Since the catalog isn’t a proprietary format, it’s possible to examine it (or potentially repair it) with readily available tools.
Step one: dump the table sizes
My first plan of attack was to look at how big the tabes were to see if there were any obvious problem spots.
SQLite provides a utility sqlite3_analizer
, that will generate a whole slew of statistics for an SQLite file. These include table sizes, utilization, storage efficiency, and so forth.
*** Page counts for all tables with their indices *****************************
ADOBE_LIBRARYIMAGEDEVELOPHISTORYSTEP.............. 126636 50.5%
ADOBE_ADDITIONALMETADATA.......................... 67200 26.8%
ADOBE_IMAGEDEVELOPSETTINGS........................ 25920 10.3%
ADOBE_IMAGEDEVELOPBEFORESETTINGS.................. 7420 3.0%
AGMETADATASEARCHINDEX............................. 4136 1.6%
ADOBE_IMAGES...................................... 3767 1.5%
AGLIBRARYFILE..................................... 3356 1.3%
If the table containing the develop settings was consuming 10.3%, why would the develop history table be 50.5% of the my catalog file?
Sure, there should be more history states than current develop settings, but 5 times as much data stored? But in terms of bytes, that’s more than 400 MB of history data.
In any event, shouldn’t Adobe be storing develop settings as efficiently as possible?
So what does the history state table look like?
Running .schema Adobe_libraryImageDevelopHistoryStep
in the sqlite3 client returns the following.
CREATE TABLE Adobe_libraryImageDevelopHistoryStep (
id_local INTEGER PRIMARY KEY,
id_global UNIQUE NOT NULL,
dateCreated,
digest,
hasDevelopAdjustments,
image INTEGER,
name,
relValueString,
text,
valueString
);
CREATE INDEX index_Adobe_libraryImageDevelopHistoryStep_imageDateCreated ON
Adobe_libraryImageDevelopHistoryStep( image, dateCreated );
That’s certainly not what I expected. What I expected was to see a whole slew of columns, one for each develop setting that needs to be stored. Maybe this was a naive view on my part.
Okay, lets pull a row from the table and see what’s actually being stored.
select * from Adobe_libraryImageDevelopHistoryStep limit 1;
id_local|id_global|dateCreated|digest|hasDevelopAdjustments|image|name|
relValueString|text|valueString
928|21A0EDF0-3FF9-4503-B1BB-986330914768|465813406.266058|
b05afdbad359c8337b9bb6e663ca8aec|-1.0|916|
Import (10/6/15 04:36:46)||s = { AutoGrayscaleMix = true,
AutoLateralCA = 0,
AutoWhiteVersion = 134348800,
Blacks2012 = 0,
Brightness = 50,
CameraProfile = "Adobe Standard",
CameraProfileDigest = "BA45C872F6A5D11497D00CBA08D5783F",
Clarity2012 = 0,
ColorNoiseReduction = 25,
Contrast = 25,
Contrast2012 = 0,
ConvertToGrayscale = false,
DefringeGreenAmount = 0,
DefringeGreenHueHi = 60,
DefringeGreenHueLo = 40,
DefringePurpleAmount = 0,
DefringePurpleHueHi = 70,
DefringePurpleHueLo = 30,
Exposure = 0,
Exposure2012 = 0,
GrainSize = 25,
Highlights2012 = 0,
LensManualDistortionAmount = 0,
LensProfileEnable = 1,
LensProfileSetup = "LensDefaults",
LuminanceSmoothing = 10,
PerspectiveHorizontal = 0,
PerspectiveRotate = 0,
PerspectiveScale = 100,
PerspectiveVertical = 0,
ProcessVersion = "6.7",
RedEyeInfo = { },
RetouchInfo = { },
Shadows = 5,
Shadows2012 = 0,
SharpenDetail = 30,
SharpenEdgeMasking = 0,
SharpenRadius = 1,
Sharpness = 50,
ToneCurve = { 0,
0,
32,
22,
64,
56,
128,
128,
192,
196,
255,
255 },
ToneCurveBlue = { 0,
0,
255,
255 },
ToneCurveGreen = { 0,
0,
255,
255 },
ToneCurveName = "Medium Contrast",
ToneCurveName2012 = "Linear",
ToneCurvePV2012 = { 0,
0,
255,
255 },
ToneCurvePV2012Blue = { 0,
0,
255,
255 },
ToneCurvePV2012Green = { 0,
0,
255,
255 },
ToneCurvePV2012Red = { 0,
0,
255,
255 },
ToneCurveRed = { 0,
0,
255,
255 },
Version = "9.2",
WhiteBalance = "As Shot",
Whites2012 = 0 }
First reaction; WTF?
A Quick Primer on Datatypes and Storage
In a computer, all data is stored in a binary format — this is the whole ones and zeros thing. However, the meaning and arrangement of those 1’s and 0’s, and therefore what they ultimately represent varies with the data’s type.
Broadly speaking there are 3 types of data that computers deal with. Integer types store whole numbers (i.e. -5, 0, 1, 1000), and do so efficiently and in a format that virtually all CPUs can process natively. Floating point numbers store a representation of decimals or fractional data (i.e. 35.4 or -10,005.35), like integers most floating point numbers are generally stored in a standard format that can also be processed natively by CPUs. Finally strings store the representation of text as a series of integers that correspond to characters in a table.
Each of those types offer various pros and cons.
For example, an 8-bit/1-byte integer can store 28 or 256 values. If the type is signed, meaning it can represent negative and positive numbers, those numbers range from –128 to 127. If the type is unsigned, it can store values form 0 to 255. Integers are always exactly what they represent, and the math is done in an exactly predictable way.
Floating point numbers are much more complex, and I’m not going to get into the details of them, but if you’re interested Wikipedia has a reasonable article about floating point numbers. Floats can represent huge ranges of numbers, but they are approximations due to the binary conversion, and these errors lead to some small levels of imprecision in calculations.
The final type I called text. Text is stored as a sequence of integers, but unlike integer values, the integers storing text represent a character not a number directly. Unlike integers or floats, a number stored in text is not something the computer can directly access and manipulate in hardware, it has to have some software to tell it how to translate the human mapped meanings into something it can process. Moreover, the amount of storage required to store a number in text is going to depend on the number of digits in the number. And this is the critical point with respect to what’s going on here in Lightroom.
For example, consider how one might store a value that ranges form 0 to 100 (like many of Lightroom’s sliders cover). To store this in an integer, only 100 values are needed, which is easily covered by the 256 possible values available from a single byte integer (1 byte = 8 bits = 2^8 = 255 options). On the other hand, if this is stored as text, it could use 1, 2, or 3 characters, and since each character requires at least 1-byte of storage, it could take as much as 3 bytes of memory to store the value.
In binary, an unsigned integer representing a value of 50 would read as 00110010. However, the textual representation ‘50’ would be two characters 5 and 0, which translate to ASCII values of 53 and 48, which in binary would be 00110101 00110000.
Now consider adding 5 to that 50. If the 50 in in binary, the computer can just add 5 (00000101) to the existing 00110010 it knows and they’re added with basically the same rules that you learned in elementary school only you carry when you sum up to 2 instead of 10. Anyway, the processor gets 00110111, which means 55 in binary.
On the other hand if you were trying to do this with the string representation, first some code would have to be called that understood how to convert the two characters into a computer usable number (and the 5 if that’s a string too). Then it would have to do the same math as done for the native computer usable type. Then if you wanted the data back as a string, convert the 55 back to two characters.
Which brings up a second aspect. Going from text to an integer — which is what the computer fundamentally requires to process data — requires more processing than if the computer already has the numbers in a format it can deal with.
Serialized Data in a Database
Before I started looking at the Library in SQLite, I had a mental model that each of the develop settings would be stored in their optimal forms as an integer, floats, or strings leveraging the database’s ability to relate values to fields to ascribe their meaning.
SQLite, like most SQL databases, has no significant problems in doing this. Each column is assigned a logical meaning, via a name, and the data stored in that column is understood to be that information implicitly by the programmers using it. Keep in mind, all of this is logical and for the benefit of the programmer, the name need not be meaningful, but it sure makes life a lot easier if it is.
Admittedly there are limits, but by default SQLite supports tables with up to 2000 columns, and can be configured to support up to 32,767 columns though it’s not recommended. Lightroom currently has about 60–65 develop settings that need to be kept track of, which is well below the 2000 column limit let alone the maximum limits.
Instead, Adobe is doing something different.
What Adobe is doing is serializing a Lua object and storing the resulting string in the database. Serialization does have some advantages, the primary one is that it allows for more simple interoperability.
Using serialization isn’t entirely unreasonable. One of the primary functions of serialization is to convert an object in the computers memory into a format that can be stored on disk without having to worry about specific formatting and implementation details. This is admittedly what Adobe is doing, saving the state of some object in Lightroom’s memory into the database.
However, serialization has limitations. One of those is that the serialized string form of an object generally will take up more space than the object did in memory. Some of that is the inherent inefficiency of converting native binary types into string representations of their value.
A second source of inefficiency is the addition of the characters that are needed to delineate and organize the representation of data. These are the curly braces, equals signs, and commas.
The Lua serializer that Adobe is using goes step further and added human readable flourishes to the data in the form of spaces and new lines. Fields are already separated by commas, and the field name and value are delineated by equals sings. There’s precious little reason to further pretty up the formatting for human readability. In fact, removing the spaces and new line characters reduce the size of the string by about 20%.
Probably the biggest source of inefficiency though comes from having to define labels so the software reading the serialized data knows what it’s reading. Those are all the human readable labels in the earlier sample.
In a database, those labels are defined essentially once, and every time you store another row of data, efficiency of those labels in terms of storage used improves. On the other hand, with serialized data, they have to be repeated every time you generate the serialized representation. Instead of a situation where the more you store, the less waste you have; the situation is that the waste increases at the same rate as the storage does.
Database Efficency
I wanted to test the theory that the database would be considerably more efficient if the individual settings were broken out into their own columns.
SQLite is a little different from most SQL database engines in how it stores data. Each column doesn’t need to have a size and type parameter specified for it — though you can as a hint. SQLite uses dynamic typing to store data in what it determines is the most efficient manner.
For example, integer types are stored using 1, 2, 3, 4, 6, or 8 bytes, depending on the value that’s sent to the SQLite engine. Store 50 and SQLite will use 1 byte. Store 200,000 and it’ll use 3 bytes.
I threw together a test program in Python to create two databases. One was loosely modeled after the table in Lightroom, though simplified to only have an index and the string column or the serialized text. The second database utilized separate columns for each of the 61 aspects that were stored in the test develop settings. I then inserted the data 10,000 times in each of the two tables.
Not unsurprisingly, the serialized string test database was much bigger than the columnar database; 20 MB versus 5 MB.
In my opinion a 75% smaller catalog is not something to sneeze at. For my more than 40,000 image catalog the reduction in the history steps table alone would shave 365 MB of my catalog file while retaining all the history. For my friends 4 GB catalog, of which 60% is history steps, reducing the storage requirements by 75% would shave 1.8 GB of disk space from his catalog.
Alternatively, I could delete all the image history in my catalog and free up about 450 MB. But in doing so I lose all that history information.
And keep in mind, the problem here isn’t that the catalog is big and disk space is in demand, it’s that the catalog has to be loaded into memory to be processed, and the larger the catalog file is, and especially the more fragmented it is, the longer this takes for that to happen and then you potentially get the problem I talked about at the start of this post.
As a secondary note, I was also curious about the performance of the serialized versus the serizlied tables. As such I timed, 10,000 insertions to see what kind of performance I was getting.
At least with Python’s implementation for SQLite, there was functionally no difference between a single big insert as a serialized string, and an insert with 61 separate parameters. The minor variations I did see, on the order of 1 microsecond, are within the realm of experimental error.
I should also point out the history step table isn’t the only place where Adobe serializes develop setting data into the catalog. The Adobe_ImageDevelopSettings table uses the same structure, and that accounts for a further 10% of my catalog size as well.
Storing Positional Data
There’s a second aspect to the history and develop data that I haven’t touched on but it’s also a major contributing factor to storage consumption by history steps and overall database efficiency. There are a number of places in Lightroom where the engine has to store x,y type positional information.
One example, and it’s seen it the same data earlier, is the tone curve. X and Y values are stored as sequential integers from 0 to 255. A straight linear mapping is stored at {0,0,255,255}
, and a mapping with curved would be stored with more points such as {0,0,64,60,128,140,255,255}
Beyond the curve, there are several tools that store image positional information. Tools like the spot healing tool, the gradient tools, and a local adjustment brush tool. These all use regular intervals of x and y coordinates along with brush size and a definition set for the affected areas.
PaintBasedCorrections = { { CorrectionActive = true,
CorrectionAmount = 1,
CorrectionID = "CFD53BB4-F91E-4616-BFDE-ECE323554311",
CorrectionMasks = {
{ CenterWeight = 0,
Dabs = { "d 0.846878 0.382692",
"d 0.821662 0.399986",
"d 0.795618 0.408876",
"d 0.769187 0.411538",
"d 0.743287 0.422603",
"d 0.717651 0.436220",
"d 0.692825 0.456801",
"d 0.666778 0.461538",
"d 0.640369 0.465035",
"d 0.614742 0.477943",
"d 0.593083 0.511154",
"d 0.568074 0.528500",
"d 0.548332 0.559867" },
Flow = 1,
MaskID = "39B9AC8F-BB01-4241-B67B-26AB767B356B",
MaskValue = 1,
Radius = 0.088226,
What = "Mask/Paint"
},
{ CenterWeight = 0,
Dabs = { "d 0.765612 0.659615",
"d 0.791546 0.647717",
"d 0.817480 0.635819",
"d 0.843414 0.623920" },
Flow = 1,
…
Note: I’ve indented the above excerpt to better show the structure, the spaces/tabs are not stored in the catalog.
As an aside, Lightroom’s painting is image size agnostic, the coordinates are saved as decimal percentages (between 0 and 1) with 6 decimal places of precision. This is especially handy as it enables Lightroom to use the same position data without multiple transformations on all export resolutions.
Using 6 places of precision also shouldn’t be a problem any time soon, as it will allow 1 pixel accuracy in images where the long edge is up to 100,000 pixels wide. For a 3:2 aspect ratio images, that would be a 6.66 giga-pixel image. For panos with high aspect ratios, a 100,000 pixel width will likely be substantially less resolution such that the limited might be approachable now or near future. However, 1 pixel accuracy is probably not necessary for most masks anyway given the coarseness of the brushes or the size of the images where it starts breaking down.
But back to the point about storage. These coordinate driven tools create significantly more data than the regular Lightroom adjustments do. Moreover, because Lightroom’s history steps are a complete snapshot of all settings at the time the history step was created, these large coordinate sets will propagate forward with subsequent history steps.
In other words, if you do any extensive work with the spot adjustment brush, spot healing tool, or the gradient tools (especially with the new masking capabilities), you can rapidity generate large history states that get stored in the database. In my database, I have a number of images with more than 10 MB of history data as a result of this behavior.
My recommendation is that whenever possible spot healing and adjustments and any masking of gradients should be kept to the end of the developing process to minimize the impact they have on ballooning catalog sizes due to their propagation in the history table.
Conclusions
I’m hesitant to just blast Adobe for doing what they did too hard. I’m not a Lua expert, and I don’t have access to the Lightroom source code (I’m even speculating that it’s mostly written in Lua). There may be very good reasons why Adobe has elected to store develop setting as serialized Lua
Thinking on that, about the only viable angle that I see is that by storing big text strings Adobe can add develop capabilities without having to update the catalog’s structure with a new column to represent the setting. When Adobe added dehaze capabilities, they only had to store that value in the string, and didn’t have to update the catalog’s tables to support the new feature.
At the same time, major Lightroom revisions have seen changes in the catalog of some structure in the catalog that has required the catalog file to up updated. Given the low computational bar for adding additional column to a table in SQLite — it wouldn’t even require doing something expensive like copying the table data — I’m not sure that that should be a significant consideration.
Certainly one could argue that with big modern disks, and certainly modern SSDs, the size of the Lightroom catalog isn’t that big of a deal. And that’s probably true, saving a couple 100 MB, or 1.8 GB in my friends case, aren’t that big of a deal when you have a 4 TB disk to store everything on.
On the other hand, as a programmer, the whole situation just looks messy to me on a number of levels. There’s a lot of extra data added to the serialized output that shouldn’t really be needed for it to be deserialize. Just removing spaces and new line characters reduced the size of the serialized test block by around 20%.
Further storage savings could be had by normalizing the coordinate based tools settings, even if they remained stored as serialized objects. This would allow much smaller index based references to be propagated from state to state instead of potentially multiple KB of repeated text entries.
Furthermore at least based on my cursory observations of the Lightroom’s catalog design, there’s a number of places where develop settings are stored redundantly. The Adobe_imageDevelopSettings table stores the current develop settings (in serialized form) and a reference to an entry in the Adobe_libraryImageDevelopHistoryStep table, which also stores the same serialized data.
There is also a Adobe_imageDevelopBeforeSettings table appears to store yet more serialized develop settings, presumably the import defaults that get applied to the image. However, the table is yet again filled with lots of large serialized objects as strings. And honestly, I’m not even sure why this table even exists.
Given that all three of these tables are referencing the same structure of data, and in the case of the develop settings and history table mirror identical datasets. There could be a whole lot of space savings by properly normalizing develop settings into their own table for all states both history and otherwise.
Bearing in mind, again, that between the history table, the develop settings table, and the before settings table, that accounts for nearly 64% of the data in the Lightroom catalog. Even just normalizing the develop settings and history steps should free up about 10% of the space used by my catalog storing redundant data.
Admittedly, the catalog is nowhere near the limits imposed by being an SQLite database. And there should be no added risk of data loss of corruption from having a large catalog file. The only negatives that should occur stem from slower random disk access and I/O loads especially on spinning disks, and even more so when the catalog is stored at the “ends” of spinning disks. Keep your catalog file on a fast SSD, and the size and fragmentation of the catalog shouldn’t become an issue until your catalog file is absolutely huge.