Ghost Ride The (Pencil) Whip

During my undergraduate study at The University of Tulsa, I was required to take Organic Chemistry and lab as part of the core Chemical Engineering requirements. Although I enjoyed the subject, the lab portion of the class was boring and tedious, and my work tended to be sloppy and rushed.

One thing we were forced to do periodically was find the melting point temperature of whatever solid substance we had precipitated out of solution that day. I remember this task distinctly because it was always the last thing we had to do before the lab was finished. The first time we had to do this I rushed the test, turning up my Bunsen burner as high as I could to get out of the lab and on to something more interesting. My boiling point was way lower than the number the literature suggested it should have been for the pure component, so I had to perform the test again to make sure there wasn’t some contaminant in my precipitate that lowered the result. What I discovered was that my mercury thermometer’s indication couldn’t quite keep up with the actual temperature of the material when heated so quickly, and there was still some lag in the reading even when I heated the precipitate at a “normal” rate.

Following this discovery, I began performing this test more slowly to make sure my numbers were correct… Just kidding, I started making my boiling points a little bit less than what the textbook said they should be and called it a day. I was a Freshman in college and had more interesting things to do than spend an extra 30 seconds watching Mercury rise. I guess the moral of that story, other than “Cory was a terrible O-Chem lab student”, is that if you tell someone to perform a test and all they need to document is a single, relatively predictable number, you can expect that number might be made up.

It seems strange now that my terrible lab work would teach me some of my best lessons. If I was to make up or massage lab numbers, I was going to give some good numbers, or at least believable ones. I figured out that different types of analysis have different biases, different ways they naturally skew data. I knew straightaway that I couldn’t make my numbers too perfect, but they also couldn’t be messed up in a way that the test method would have never done. In the real world, this knowledge is far more valuable than most of the things you officially ‘learn’ in lab, because as it turns out testing results get made up constantly.

Back in 2014 I was just starting my stint as an Offshore Process Engineer for an FPSO in Brazil. One of my first orders of business was chasing down strange results we were having with a field test that consistently produced different results from an onshore lab, and I decided to make my own standard to verify the device we were using was accurate.

The results of the field test were in parts per million (ppm), and the calibrated range of the device should have gone up to at least 100 ppm. However, when I tried to make my own 100 ppm standard the device read it as 350 ppm. I made two more 100 ppm standards and tested them across all of the available field testing devices, and got readings ranging from 340-380 ppm. I may not have been a great chemistry lab student but I’m pretty sure I can pipette a standard volume of fluid without being off by a several hundred percent. At least, I never had pipetting problems in the past, but now I wasn’t so sure that maybe I had skipped the class where they explained how pipettes will secretly suck up some random quantity of liquid unless you know the magic word[i].

OK, enough pipette talk. Creating a standard wasn’t the first troubleshooting method attempted, as a colleague of mine who was convinced the issue was with the onshore lab had left on my desk four signed and dated calibration certificates from a licensed third party contractor that showed that all four devices I had tested were nearly perfect. Each certificate showed how the device read the standards created by the contractor, who presumably hadn’t been absent from the aforementioned pipette awareness day in college:

Prepared Standard (ppm) Device 1 Reading Device 2 Reading Device 3 Reading Device 4 Reading
0 0 0 0 0.1
10 9.7 10.2 10 10.4
50 49.4 50.3 50.5 50.7
100 99.6 100 101 99.9


Drawing upon my experience as a terrible lab student, I immediately knew something was wrong, or we had gotten some sort of magical chemistry wizard out to prepare the most accurate standards I had ever seen. I refused to believe this guy prepared a bunch of standards so precisely that he got each of the devices to read within about 1% of where they should be while I seemed to be hitting everything but the lottery with my results. When I wrote to the vendor support site, even the president of the company that manufactured the equipment found the need to chime in and note that they could not reproduce these types of numbers that the contractor had provided in their own lab. The certificates were almost certainly BS.

But just to be certain…I had our certified third party calibration expert come back out and watched over his shoulder as he got the devices all calibrated and reading correctly, this time within a +/- 15% range. As suspected, all of the previously calibrated devices were off by a factor of about 3.5, and my ability to work a pipette was validated. I distinctly remember being more upset that the guy wasn’t competent enough to make up reasonable numbers than I was that he didn’t actually do any work. The latter I had already become accustomed to expecting in Brazil, the former would take me another few months to get used to as I had more run-ins with what they call the Jeitinho Brasileiro.

A few months later I ran into an issue with results coming from the same device. To be fair, the problem I had wasn’t that the device itself was giving bad readings as it was the people responsible for collecting data seemed to disappear for long periods of time without performing any tests. I decided since I had thousands of data points from them, I might as well see if I could do some sort of randomization analysis of their results to see if anything was amiss[ii].

Randomization analysis is tricky. It’s not often that you can say you have data that should be truly random. Even cherry-picking a random decimal point in a dataset might not work if the results cluster or there are too few significant digits in the data. For example, imagine a test reporting out to the tenths digit numbers that typically fell around 1.9 to 2.3, but never fell below 1.8. This would cause the results to bunch around the minimum, over-representing numbers close to it. Even the tenths digit would be a poor choice of a random number, as this would be skewed towards the digits that appeared in the most common results, 1.9, 2.0, 2.1, 2.2, etc., while I would also expect a relative dearth of threes, fours, fives, sixes, and sevens if the result rarely rose much past 2. However, if you were to examine a test where the results were spread between 15 and 100 (increasing your number of significant digits to 3 in the process), those tenths-digit numbers might start to look very random.

Fortunately, these happened to be the type of results I recently had to review. They ranged from the mid-teens to over a hundred, with no clear low-end asymptote to skew that last tenths digit one way or another. Perhaps if some of the operators performing the test rounded off results to the nearest whole number I would expect to see an excess of zeroes, but other than that I couldn’t think of a reason that tenths digit wouldn’t appear to be random. Most importantly, I had been diligently recording all of these results into an Excel spreadsheet every day for over a year, and everyone knows that you can’t have more than a hundred numbers in any given spreadsheet without some sort of dubious statistical analysis being performed on it[iii].

I took all of the results and deleted all of the times where no reading was taken (as these blanks would register as 0.0 and throw off the analysis), then used the MOD function of excel to get the remainder left over when you divided the results by 1. Multiplying this by 10 gave me a neat set of whole numbers between 0 and 9. I used the Excel data analysis add-in to run a histogram on these numbers and found that out of 7358 readings the tenths digit was only zero 531 times. Using the binomial distribution function of excel I calculated the odds that a random sample of 7358 numbers between 0 and 9 would only have 531 or fewer 0’s. The odds of that happening naturally in a set without any inherent bias are, to put it lightly, low.

This is the part where I want to caution that the fact that an outcome is unlikely does not necessarily mean foul play was involved. For instance, you would only have a 12.5% chance of winning a coinflip three times in a row, but I wouldn’t automatically call you a liar or a cheat if this happened. However, the odds of the results containing this few “0’s” is the same as winning that coinflip 56 times in a row. As they like to say at the coinflip table in Vegas; “Fool me 55 times, shame on you, fool me 56 times, shame on me.” To make matters worse, if any real results were rounded to the nearest whole number, the actual number of legitimate zeroes would be even less. Also, it just so happens that when human beings try to create random numbers, they generally select 0 a lot less frequently than other numbers[iv].

Digit Frequency
0 531
1 843
2 873
3 661
4 818
5 705
6 658
7 795
8 750
9 724
Total 7358


So there are approximately 200 zeroes missing-plus or minus maybe 30, as there is about an 8% chance you would get fewer than 700 zeroes, back within the realm of possibility. On average, you’d have to make up 10 readings without a zero for a zero to go missing, and you would have to make up 2000 readings for 200 zeroes to go missing. And that assumes that you never make up readings with round numbers. If they make up random readings that are whole numbers 5% of the time instead of never, then that means that about 4000 of the readings are fabricated, as it would take 20 bad readings to get rid of a zero.  And of course, if any of those guys ever round off their numbers, boosting the number of zeroes artificially, the situation could be even worse than this indicates.

For me, the moral here is that if the number looks strange, investigate the number first. This doesn’t just apply to numbers supplied by humans either, transmitters slide from their calibration or break for other reasons all the time[v]. I have found data scrambled in so many ways that it would be impossible to remember them all. However, what I do know is that it’s much easier to troubleshoot why a number is wrong than to scour your process in vain searching for explanations for garbage data.

[i] Skittles

[ii] This is apparently not everybody’s first response to a problem like this. In fact, my former rotating alternate had this to say about my analysis: “Wow, you really went full Beautiful Mind on that.  I am surprised that I did not go into our cabin to find a bunch of newspaper articles cut out with pins and red string connecting them.  Or photos of people with the eyes cut out.  I think that was a different movie.

[iii] Cough, six sigma, cough.

[iv] I know I read this somewhere, but I can’t remember where I got this from. It makes sense to me though, I can certainly imagine I would instinctively put a non-zero number at the end of my bogus data if my goal was to create random looking numbers, or at least numbers that didn’t seem made up.

[v] Seems odd to leave a footnote a sentence before the end of the article, but seems wrong to leave out one of my favorite bad data troubleshooting stories entirely. Two machines testing a refined product specification in a refinery had never disagreed until one day, they did…repeatedly. Nobody could figure out why, but they could pinpoint that the machines started disagreeing after they changed the specification in question from -40 degrees to -50 or so. Eventually somebody found out that the first machine was configured to read Celsius, while the second one read Fahrenheit. I guess that just goes to show that there’s nothing like the Jet A cloud point specification to bring together the English and SI systems of measurement.