Web Notifications

SaltWire.com would like to send you notifications for breaking news alerts.

Activate notifications?

Colby Cosh: Geneticists and Microsoft were eyeball to eyeball, and the geneticists blinked

STORY CONTINUES BELOW THESE SALTWIRE VIDEOS

Calling Chard: asparagus and leek risotto with chicken | SaltWire

Watch on YouTube: "Calling Chard: asparagus and leek risotto with chicken | SaltWire"

I am unsure if this news item is amusing or ominous, but this week there was a small seismic event in the world of biology. A group called the Human Gene Nomenclature Committee, which does what it sounds like it does, published updated guidelines for naming genes. Genes, you will be thrilled to know, are given a short “symbol” and a longer descriptive name. A familiar example would be the sequence on human chromosome 17 that exposes women to a risk of breast cancer: its familiar symbol is BRCA1, and it used to be “breast cancer (gene) 1.” (There is a BRCA2, “breast cancer (gene) 2”).

Gene nomenclature, it turns out, is due for housecleaning, and so the committee in charge has created some common-sense new style rules. Symbols are now explicitly confined to Latin letters and Arabic numerals; they shouldn’t contain “G” for “gene,” because that’s silly; and they should avoid possible offensiveness, which means that a family of genes called “DOPEY” is acquiring new names.

(The original DOPEY gene, DOPEY1, expresses a “domain protein” first found in yeast. DOP1 became “DOPEY” out of laboratory whimsy. But when DOPEYs were found to be involved in cognitive deficits in humans, the joke wore out fast.)

There were other changes, but the one that got everybody’s attention and even created a minor stir on Twitter involved … Microsoft Excel, the spreadsheet software on which the world, from finance and accounting to research and engineering, is not-so-secretly run.

Excel, in the tradition of Microsoft products, is designed for the naive user. One of the helpful features designed to accommodate those users is that Excel turns every string of text that looks even slightly like a date into, well, a fully formatted date. So if you input “DEC2” the program will hasten to assume you are talking about Dec. 2, and reformat the cell automatically.

You can probably see where this is going. There is, in fact, a gene that used to be called DEC2. The “DEC” stands for “differentially expressed in chondrocytes,” and, no, I don’t have any idea what that means. The point is that old gene names like DEC2 and OCT11 and SEPT9 were causing very serious harm to scientific research.

Almost all quantitative scientists use Excel at some point in their data-slinging. Young researchers might receive advice about the dangers of Excel in scientific applications, but most probably don’t. It may be taken for granted when they become grad students that they are familiar with Excel, or that they, being clever enough to know what a chondrocyte is, can learn Excel overnight.

So the gene names that resemble dates were causing chaos. The seminal 2004 paper on this is called “Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics.” Large datasets from genetic microarrays, then just beginning to be created and exchanged, were being pushed through Excel and creating errors in output — and sometimes in statistical conclusions.

There are workarounds for this problem (and others like it!), but they are laborious, and you know what humans are like. The republic of science is large, but it is badly outnumbered by the people who use spreadsheet software to keep track of appointments, budgets and fantasy baseball leagues. Microsoft wasn’t going to change. Scientists, who weren’t giving up Excel either, had to bend. The Human Gene Nomenclature Committee has abolished, by rule, DEC2 and OCT11 and their confusing brethren.

Probably no one will remember this problem in 50 years’ time, and you and I may have forgotten it in 10 days. But there is something both unnerving and majestic about it. Computer spreadsheets are so ubiquitous that it is hard to imagine, though true, that they have a single substantial inventor, Dan Bricklin, who is still alive and working: he turns 70 next year.

He thought of the original spreadsheet program, VisiCalc, while auditing a business class, and it was small businessmen wanting the advantages of VisiCalc who were responsible for the first takeoff of Apple Inc. The old joke is that Apple got huge selling $2,000 computers to people who wanted to run one program costing $100. When the company making VisiCalc got sick of the joke and hiked the price, competitors appeared, and VisiCalc was succeeded as the market leader by Lotus 1-2-3 and then Excel. But most people still use spreadsheets in exactly the way a bodega manager might have used VisiCalc in 1981.

Nobody ever thought, probably, that this would ever have anything to do with research into the human genome — either that spreadsheets would be essential to everyday work on human genetics, or that the obnoxious habits of one particular model of spreadsheet software would impede that work.

Computer specialists, across all echelons from holders of academic chairs to the Postmedia IT people currently giving me a hard time, share a terrible knowledge. They know the whole world does run on software, that dams can burst and business empires shrivel if it fails. They know that the collision between spreadsheets and genetics took a quarter-century to become manifest. And they know that similar bugs/features may be all around us, swarming greedily in the matrix we inhabit.

National Post

Twitter.com/ColbyCosh

Copyright Postmedia Network Inc., 2020

Share story:
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT