Think day on data management

As I’ve been working on my think days, I’ve made it a habit to share the generalized prompts here for anyone who may want to go through this exercise. I design the prompts together with ChatGPT and tailor them then to my situation.

1. Framing the problem (10 minutes)

Data management matters in my group because the intellectual value of our work increasingly outlives individual projects and individual people. At the moment, too much knowledge is stored in heads, personal laptops, and tacit habits. This makes the group fragile: when someone leaves, context leaves with them. Poor data management has already cost time (re-deriving results), credibility (difficulty reproducing internal analyses), and strategic momentum (datasets that could have seeded follow-up projects but didn’t).

In 3–5 years, I do not want to run an artisanal research atelier that only works because I personally remember everything. I want an institutional-grade research group that scales, survives turnover, and supports ambitious interdisciplinary work.

Anchor principle: data management is not administration; it is research infrastructure, comparable to labs, HPC access, or funding pipelines.

2. Mapping current reality (20 minutes)

Data creation
Highly diverse: experiments, simulations, field measurements, surveys, literature datasets. Methods are strong; documentation varies by person and seniority.

Storage
Fragmented. Combination of personal laptops, cloud folders, institutional drives, collaborator systems. No single “source of truth.”

Structure
Mostly ad hoc. Folder logic differs per PhD/postdoc. Naming conventions inconsistent. Versioning often implicit (“final_final_v3”).

Documentation
Uneven. Some projects have excellent internal notes; others rely on memory or code comments. README files are rare unless required by journals.

Access & sharing
Works while people are present. Becomes difficult when someone is traveling, overloaded, or leaving. New group members struggle to onboard into existing datasets.

Archiving & reuse
Data is usually archived for publication, not for future research. Reuse across projects is the exception, not the norm.

Security & ethics
Handled carefully in spirit, but practices are implicit rather than standardized. Backups depend on individual discipline.

Failure points
Transitions: people leaving, projects ending, collaborations pausing. This is where knowledge leakage happens.

3. Pain points and risks (15 minutes)

1. Knowledge loss when researchers leave

Cost: high (lost intellectual capital, repeated work)
Frequency: structural
Fixability: medium

2. Internal reproducibility gaps

Cost: medium–high (time, trust)
Frequency: recurring
Fixability: high

3. Time wasted deciphering old data/code

Cost: high (PI time especially)
Frequency: recurring
Fixability: high

4. Version confusion in collaborative writing and analysis

Cost: medium
Frequency: recurring
Fixability: high

5. Underuse of existing datasets for new funding or papers

Cost: strategic opportunity loss
Frequency: recurring
Fixability: medium

Highest-impact issues:
1, 3, and 5.

4. Defining “good enough” standards (20 minutes)

Non-negotiable (mandatory):

One standard top-level folder structure per project.
Clear, human-readable file naming (date + content + version).
One README per dataset or project folder.
Explicit handover package when someone leaves (data + explanation).
Final datasets stored in a designated group repository after publication.
Minimum backup standard (automated or institutional).

Recommended (not enforced):

Version control for code.
Templates for metadata and documentation.
Internal changelog for major analyses.

Explicit philosophy:
We aim for clarity over elegance and continuity over optimization. Standards exist to reduce cognitive load, not to police creativity.

5. Tools, roles, and responsibility (15 minutes)

Tools

The problem is not lack of tools but inconsistent use.
Priority is standard behavior on existing platforms before adding new ones.

Roles

PI (you): define standards, enforce handovers, model good practice.
PhDs/Postdocs: responsible for documentation and clarity of their own data.
Group as a whole: shared responsibility for onboarding and continuity.

Institutional memory

Currently resides too much with you.
Needs partial externalization into structure and documentation.

Centralization vs decentralization

Standards and archiving: centralized.
Day-to-day working files: decentralized within a shared logic.

6. Improvement roadmap (10 minutes)

Immediate (next 3 months)

Define and circulate a standard folder structure.
Introduce mandatory README files.
Require a data handover checklist for leaving researchers.

Mid-term (6–12 months)

Clean up legacy projects with highest reuse potential.
Create onboarding documentation for new group members.
Align practices across USFQ and TU Delft projects where possible.

Long-term (1–3 years)

Treat datasets as strategic assets for funding and collaboration.
Integrate data management explicitly into supervision and evaluation.
Move toward institutional alignment (libraries, research office, ethics).

7. Reflection and alignment (10 minutes)

My current approach reflects a leadership style that prioritizes trust, autonomy, and intellectual depth, but sometimes underestimates the cost of informality at scale. I need to be stricter about exits, handovers, and minimum standards, even if that feels bureaucratic.

At the same time, I should resist micromanagement: good systems should allow me to trust more, not less.

If the group doubles in size, the current setup will not hold. Data management is one of the clearest places where early structure buys long-term freedom.

Draft data management philosophy:

Data in my group is a shared intellectual asset. It should be understandable, reusable, and ethically sound beyond the lifespan of individual projects or people. Our goal is not perfection, but continuity.

If you want, the next useful step would be to overlay your handwritten version with this one and mark:

where you are stricter than this,
where you are more permissive,
and where your emotional resistance shows up.

That contrast is usually where the real leadership insight lives.

can you return to the topics that we outlined for the 1.5 hour think day, and turn this into a more general description for any other colleague who may want to use this?

Absolutely. Below is a generalized, reusable description of the 1.5-hour Think Day, written so any colleague—PI, group leader, or senior researcher—can use it without needing your personal context. It keeps the intellectual spine of what we designed, but removes the “Eva-specific” layer.

Think Day Design (1.5 hours)

Data Management and Improvement in a Research Group

Purpose

This Think Day is designed to help research group leaders step back from day-to-day firefighting and reflect strategically on how data is handled across their group. The focus is not on compliance or tooling for its own sake, but on improving research quality, efficiency, continuity, and resilience.

The session treats data management as research infrastructure: something that quietly determines whether a group scales, survives turnover, and can build cumulatively on its own work.

Intended outcomes

By the end of the session, participants should have:

A clear picture of how data currently flows through their group.
Insight into the main risks and inefficiencies in current practices.
A small set of realistic, enforceable standards.
A concrete improvement roadmap with priorities and ownership.
Greater clarity on their own leadership stance toward structure, autonomy, and responsibility.

Structure and topics (90 minutes)

1. Framing the problem (≈10 minutes)

Participants begin by clarifying why data management matters for their specific context, rather than in abstract policy terms.

Reflection topics include:

How poor data practices already affect time, quality, or continuity.
Whether current practices depend on individuals rather than systems.
What kind of research group they want to be running in 3–5 years.

The goal is to articulate a problem statement that anchors the rest of the session.

2. Mapping current reality (≈20 minutes)

This segment focuses on making implicit practices explicit.

Participants map their group’s current data lifecycle, typically including:

Data creation (experiments, simulations, surveys, fieldwork, etc.).
Storage locations and access.
Folder structures and naming conventions.
Documentation and metadata practices.
Sharing within the group and with collaborators.
Archiving after publications or project completion.
Security, backups, and ethical considerations.

Attention is paid to where things work smoothly and where they tend to fail—especially during transitions such as people leaving or projects ending.

3. Identifying pain points and risks (≈15 minutes)

Participants identify recurring problems and risks related to data management.

Typical prompts include:

Where time is repeatedly lost.
Where reproducibility breaks down.
Where knowledge disappears.
Where strategic reuse of data fails.

Each issue is considered in terms of impact, frequency, and ease of improvement, allowing participants to focus on the few problems that matter most.

4. Defining “good enough” standards (≈20 minutes)

This section is about designing minimum viable rigor, not perfection.

Participants reflect on what should be:

Mandatory across the group.
Recommended but flexible.
Explicitly left to individual preference.

Topics often include folder structures, naming conventions, documentation requirements, handover rules, version control, and long-term storage.

The emphasis is on standards that are:

Easy to explain.
Easy to enforce.
Effective in reducing confusion and rework.

5. Tools, roles, and responsibility (≈15 minutes)

Here the focus shifts from norms to execution.

Participants reflect on:

Which tools are already in use and how consistently they are applied.
Where behavioral standardization matters more than new tools.
Who currently holds “institutional memory” in the group.

Roles and responsibilities are clarified across levels (PI, PhD students, postdocs, group-level practices), with attention to balancing trust and accountability.

6. Improvement roadmap (≈10 minutes)

Participants translate insights into action by sketching a simple roadmap:

Short-term improvements (next few months).
Medium-term structural changes (6–12 months).
Long-term ambitions (1–3 years).

Each action is associated with an owner, a timeline, and a success criterion, keeping the plan realistic and implementable.

7. Reflection and leadership alignment (≈10 minutes)

The session closes by connecting data practices to leadership identity.