Example output
# Research Artifact Archive Spec
## Project: "LONGRUN — Longitudinal Gut Microbiome Shifts in Endurance Athletes"
## Duration: 5 years (2026-2031), shared across institutional collaborators and external data consumers
---
## 1. Top-Level Directory Structure
```
longrun/
├── 00_protocol/ # IRB, pre-registrations, SOPs, consent forms — human-readable only
├── 01_raw/ # Immutable source data; never edited, never deleted
├── 02_processed/ # Cleaned, QC'd, derivative datasets; regeneratable from raw + code
├── 03_analysis/ # Notebooks, scripts, statistical models — version-controlled
├── 04_figures/ # Publication-ready figures with generating code linked
├── 05_manuscripts/ # Drafts, submissions, reviews; one folder per submission
├── 06_presentations/ # Conference talks, posters, public-facing outputs
├── 07_supplementary/ # Supplementary data, appendices, response-to-reviewer files
├── 08_personnel/ # Onboarding docs, contribution records, handoff notes
└── 99_archive/ # Frozen project snapshots at publication milestones
```
**Rationale:** Numeric prefixes enforce sort order and visual workflow. Raw and processed data are strictly separated so processing pipelines are reproducible. `99_archive` preserves point-in-time snapshots even as active folders evolve.
## 2. Naming Convention
**Pattern (regex):**
`^(?<date>\d{8})_(?<project>[a-z0-9]+)_(?<stage>[a-z]+)_(?<subject>[A-Z0-9-]+)_(?<desc>[a-z0-9-]+)(_v(?<version>\d+))?\.(?<ext>[a-z0-9]+)$`
**Examples:**
- `20260314_longrun_raw_S042-W00_16s-reads.fastq.gz`
- `20260314_longrun_raw_S042-W00_metadata.json`
- `20260421_longrun_processed_S042-W00_asv-table_v3.tsv`
- `20260501_longrun_analysis_cohort-qc_v2.ipynb`
- `20260715_longrun_figures_fig2-alpha-diversity_v1.pdf`
- `20261001_longrun_manuscripts_main-text_submission1.docx`
- `20270115_longrun_presentations_iscb-talk_v2.pdf`
Rules: lowercase kebab-case for descriptors, uppercase alphanumeric for subject IDs (S042 = subject 42, W00 = week 0). Date is ISO without separators. Version suffix only when the same artifact is iterated.
## 3. Version Control Strategy
- **Git (GitHub organization repo):** Everything in `00_protocol`, `03_analysis`, `05_manuscripts`, `08_personnel`. Small (<10MB) reference tables in `02_processed`.
- **DVC + S3-compatible object store (institutional MinIO):** All of `01_raw` and `02_processed`. DVC pointers committed to git; data itself lives in object storage with versioning enabled.
- **Institutional data repository (Zenodo for publication, institutional repository for internal):** Frozen snapshots at each manuscript acceptance. DOIs assigned.
- **Never in git:** Participant-identifying data, consent forms with signatures, large binaries without DVC.
## 4. Metadata Schema (YAML sidecar per artifact)
```yaml
# Required
artifact_id: longrun_raw_S042-W00_16s-reads_20260314
created: 2026-03-14T09:22:00Z
created_by: a.diallo@institution.edu
stage: raw | processed | analysis | figure | manuscript
subject_id: S042 # if applicable
timepoint: W00 # if applicable
description: "16S rRNA amplicon reads from subject 42, baseline visit, sequenced on MiSeq v3."
license: CC-BY-4.0
checksum_sha256: 3f8a...
# Optional but encouraged
upstream_artifacts: [ ... ] # parent artifact IDs
downstream_artifacts: [ ... ]
pipeline_version: dada2-v1.28
related_publications: [ doi:10.xxxx/... ]
notes: free text
```
## 5. Archiving Milestones
| Stage | Frozen to `99_archive/` | DOI issued |
|-------|--------------------------|------------|
| Pre-registration lock | Protocol + analysis plan | Internal repo |
| Data collection complete | Raw data + metadata | Zenodo |
| Preprint submission | Analysis code + figures + manuscript | Zenodo |
| Journal acceptance | Final published artifact bundle | Zenodo (public) |
| Project end | Everything, with README.md explaining layout | Institutional repo |
## 6. FAIR and Access Control
- **Findable:** Every archived artifact receives a persistent identifier. Metadata indexed in institutional catalog.
- **Accessible:** Public artifacts released under CC-BY-4.0. Sensitive data released via controlled access with DUA.
- **Interoperable:** Use standard formats (FASTQ, TSV, BIOM, CSV). Avoid proprietary formats in archived copies.
- **Reusable:** Every dataset accompanied by README, metadata sidecar, and a dated snapshot of the processing code that produced it.
## 7. Deaccession Policy
At project end plus 10 years: review each artifact against institutional retention policy. Publications remain permanent. Raw sequencing reads retained indefinitely if cohort consented to indefinite retention; otherwise destroyed per DUA. Analysis intermediates may be deleted if fully regeneratable from raw data and published code. Deaccession log committed to git before deletion.