10 days ago•

Research artifact archive structure and naming convention builder

Views

10.5K

Copies

2.0K

Likes

1.1K

Comments

Copy rate

19.0%

Prompt

You are a research data management specialist. I run a [lab_type] and need a complete file, folder, and metadata convention for archiving all research artifacts from a project titled "[project_name]" over its expected [project_duration] lifespan. We work with [data_types] and share across [collaboration_scope].

Produce:
1. Top-level directory structure with a rationale for each folder
2. Naming convention for files, with a regex pattern and 5-10 examples
3. Version control strategy — what goes in git, what goes in DVC/S3, what goes in a data repository
4. Metadata schema (YAML front matter or sidecar JSON) with required and optional fields
5. Archiving milestones — what gets frozen at what research stage
6. Access-control and FAIR principles compliance notes
7. Retirement and deaccession policy at project end

Be opinionated. Pick one convention, not three.

Customise this prompt

Fill in 5 variables to personalise this prompt

Lab Type

Project Name

Project Duration

Data Types

Collaboration Scope

Preview

You are a research data management specialist. I run a [lab_type] and need a complete file, folder, and metadata convention for archiving all research artifacts from a project titled "[project_name]" over its expected [project_duration] lifespan. We work with [data_types] and share across [collaboration_scope]. Produce: 1. Top-level directory structure with a rationale for each folder 2. Naming convention for files, with a regex pattern and 5-10 examples 3. Version control strategy — what goes in git, what goes in DVC/S3, what goes in a data repository 4. Metadata schema (YAML front matter or sidecar JSON) with required and optional fields 5. Archiving milestones — what gets frozen at what research stage 6. Access-control and FAIR principles compliance notes 7. Retirement and deaccession policy at project end Be opinionated. Pick one convention, not three.

Example output

# Research Artifact Archive Spec ## Project: "LONGRUN — Longitudinal Gut Microbiome Shifts in Endurance Athletes" ## Duration: 5 years (2026-2031), shared across institutional collaborators and external data consumers --- ## 1. Top-Level Directory Structure ``` longrun/ ├── 00_protocol/ # IRB, pre-registrations, SOPs, consent forms — human-readable only ├── 01_raw/ # Immutable source data; never edited, never deleted ├── 02_processed/ # Cleaned, QC'd, derivative datasets; regeneratable from raw + code ├── 03_analysis/ # Notebooks, scripts, statistical models — version-controlled ├── 04_figures/ # Publication-ready figures with generating code linked ├── 05_manuscripts/ # Drafts, submissions, reviews; one folder per submission ├── 06_presentations/ # Conference talks, posters, public-facing outputs ├── 07_supplementary/ # Supplementary data, appendices, response-to-reviewer files ├── 08_personnel/ # Onboarding docs, contribution records, handoff notes └── 99_archive/ # Frozen project snapshots at publication milestones ``` **Rationale:** Numeric prefixes enforce sort order and visual workflow. Raw and processed data are strictly separated so processing pipelines are reproducible. `99_archive` preserves point-in-time snapshots even as active folders evolve. ## 2. Naming Convention **Pattern (regex):** `^(?<date>\d{8})_(?<project>[a-z0-9]+)_(?<stage>[a-z]+)_(?<subject>[A-Z0-9-]+)_(?<desc>[a-z0-9-]+)(_v(?<version>\d+))?\.(?<ext>[a-z0-9]+)$` **Examples:** - `20260314_longrun_raw_S042-W00_16s-reads.fastq.gz` - `20260314_longrun_raw_S042-W00_metadata.json` - `20260421_longrun_processed_S042-W00_asv-table_v3.tsv` - `20260501_longrun_analysis_cohort-qc_v2.ipynb` - `20260715_longrun_figures_fig2-alpha-diversity_v1.pdf` - `20261001_longrun_manuscripts_main-text_submission1.docx` - `20270115_longrun_presentations_iscb-talk_v2.pdf` Rules: lowercase kebab-case for descriptors, uppercase alphanumeric for subject IDs (S042 = subject 42, W00 = week 0). Date is ISO without separators. Version suffix only when the same artifact is iterated. ## 3. Version Control Strategy - **Git (GitHub organization repo):** Everything in `00_protocol`, `03_analysis`, `05_manuscripts`, `08_personnel`. Small (<10MB) reference tables in `02_processed`. - **DVC + S3-compatible object store (institutional MinIO):** All of `01_raw` and `02_processed`. DVC pointers committed to git; data itself lives in object storage with versioning enabled. - **Institutional data repository (Zenodo for publication, institutional repository for internal):** Frozen snapshots at each manuscript acceptance. DOIs assigned. - **Never in git:** Participant-identifying data, consent forms with signatures, large binaries without DVC. ## 4. Metadata Schema (YAML sidecar per artifact) ```yaml # Required artifact_id: longrun_raw_S042-W00_16s-reads_20260314 created: 2026-03-14T09:22:00Z created_by: a.diallo@institution.edu stage: raw | processed | analysis | figure | manuscript subject_id: S042 # if applicable timepoint: W00 # if applicable description: "16S rRNA amplicon reads from subject 42, baseline visit, sequenced on MiSeq v3." license: CC-BY-4.0 checksum_sha256: 3f8a... # Optional but encouraged upstream_artifacts: [ ... ] # parent artifact IDs downstream_artifacts: [ ... ] pipeline_version: dada2-v1.28 related_publications: [ doi:10.xxxx/... ] notes: free text ``` ## 5. Archiving Milestones | Stage | Frozen to `99_archive/` | DOI issued | |-------|--------------------------|------------| | Pre-registration lock | Protocol + analysis plan | Internal repo | | Data collection complete | Raw data + metadata | Zenodo | | Preprint submission | Analysis code + figures + manuscript | Zenodo | | Journal acceptance | Final published artifact bundle | Zenodo (public) | | Project end | Everything, with README.md explaining layout | Institutional repo | ## 6. FAIR and Access Control - **Findable:** Every archived artifact receives a persistent identifier. Metadata indexed in institutional catalog. - **Accessible:** Public artifacts released under CC-BY-4.0. Sensitive data released via controlled access with DUA. - **Interoperable:** Use standard formats (FASTQ, TSV, BIOM, CSV). Avoid proprietary formats in archived copies. - **Reusable:** Every dataset accompanied by README, metadata sidecar, and a dated snapshot of the processing code that produced it. ## 7. Deaccession Policy At project end plus 10 years: review each artifact against institutional retention policy. Publications remain permanent. Raw sequencing reads retained indefinitely if cohort consented to indefinite retention; otherwise destroyed per DUA. Analysis intermediates may be deleted if fully regeneratable from raw data and published code. Deaccession log committed to git before deletion.

Prompt

Customise this prompt

Example output

Related prompts