Setting up good data management habits

The Buchwalter lab is growing, projects are taking shape, and lab members are starting to generate data! Now is an opportune time to set some good habits for data management.

What are our goals here? There’s the short term goal of being able to keep track of data we generate, analyze it, share it, present it, and publish it. There’s also the long term responsibility of saving all published data in an identifiable and accessible form for future reference. It can sometimes happen that questions about data or analyses can arise after publication, and we need to be able to re-trace our work after the fact.

How do we meet these goals? We need data storage habits that are redundant and identifiable.

Redundancy: data is stored in at least two places. Those are:

Box Cloud storage
local server and/or hard drive storage: stay tuned! I am working on solutions for us. This will either be hosted by our department or within our lab space.

Identifiability: it’s clear who generated what data when

There’s only one good way to keep data identifiable: more data! By that I mean, METADATA (“data about your data”). This metadata should answer these questions:

Who generated it: it’s saved in your folder (within Buchwalter folder on Box; your folder on server/hard drive)
When was it generated: full MM/DD/YYYY date; this could be in the name of the subfolder, for instance.
What is it: is it a raw or processed microscopy image? a Western blot? raw data from a qPCR run? This could be included in the file name.
What are each of the samples being analyzed and how were they generated? This level of detail gets tough to fit into a filename. It may be necessary to make a small text file to include in folder with the data to explain what all the samples are, and perhaps referencing back to a page in your lab notebook that describes the sample preparation.
Is this raw data or processed / analyzed data? If the latter, where is the raw data? Include identifying info to maintain that link.

A proposal: Let’s have a lab-wide data cleanup day every 6 months. This will be an opportunity to:

rescue data that is living on various temporary storage sites (i.e. instrument computers)
make sure that raw data is backed up on Box
make sure that your raw data has METADATA associated with it to make clear what the experiment was
Back up data onto lab server that fits these criteria:
- has associated METADATA
- is from an experiment that technically worked (regardless of outcome)
Do something fun! (lunch, bubble tea, mini golf….)