
AI visibility report
lakeFS ranks #1 in AI Data Curation and Dataset Versioning AI search.
Outside the top three on 5 of the 25 prompts buyers actually ask.
DataChain is cited on 2 of those losses.
Free trial. Setup comes pre-filled for lakeFS.
Track lakeFS across these prompts daily.
Start free trialBest among 7 vendors · still absent from 89.3% of tracked prompt responses
Top-3 citations across 75 prompt × platform pairs
Peer Ranking
Key Metrics
Platform Breakdown
Leader, with room to expand. lakeFS leads this category on presence and share of voice, but appears in only 10.7% of tracked prompt responses. The priority is defending current wins while expanding absolute coverage.
Where lakeFS is losing
Prompts where competitors are visible and lakeFS is not.
These prompt-level losses are the first prompts to track and repair.
Where lakeFS is winning5
Looking for a Git-like workflow for branching, committing, and merging changes to large training datasets stored in S3.
Avg # 1.0 · 1 platform
Looking for a Python-native data pipeline framework that handles parallelism, checkpointing, and lineage without ETL infrastructure.
Avg # 1.0 · 1 platform
Which tool gives me reproducible dataset snapshots without copying terabytes of data?
Avg # 1.0 · 1 platform
How do I build a reproducible data preprocessing pipeline that reads from S3, applies Python transforms, and writes a versioned dataset?
Avg # 1.0 · 2 platforms
What's the cleanest way to version control datasets alongside code for an ML project?
Avg # 3.0 · 2 platforms
Where lakeFS is losing5
Which tool supports incremental dataset builds — only reprocess the new files when underlying storage changes?
Competitors on 1 platform
Track this promptHow are teams using embedding maps to surface coverage gaps and bias in training data?
Competitors on 1 platform
Track this promptI have millions of unlabeled videos in S3 — which tool can help me filter and enrich them with model-generated metadata before training?
Competitors on 1 platform
Track this promptWhat's the best way to curate a large image and video dataset for training a multimodal model?
Competitors on 1 platform
Track this promptLooking for a Python SDK that lets me apply LLMs and vision models to clean and enrich a training dataset without moving data out of cloud storage.
Competitors on 1 platform
Track this prompt
Track lakeFS daily before the next report refresh.
Track these gapsResearch dossierCapabilities, use cases, sources, reviews, pricing, and FAQ
Overview
lakeFS, developed by Treeverse, is an open-source data version control system that applies Git-like operations—branching, committing, merging, and reverting—to data lakes stored in object storage (Amazon S3, Azure Blob Storage, Google Cloud Storage). Founded in 2020 by Oz Katz and Dr. Einat Orr, lakeFS operates as a metadata layer atop existing storage without moving or duplicating data. It enables data and AI/ML teams to create isolated environments for testing, ensure reproducibility of model training, enforce data quality via CI/CD hooks, roll back from data incidents, and maintain full data lineage and audit trails. Available as a free open-source edition and a commercial Enterprise tier, lakeFS is used by organizations including Arm, Netflix, Volvo, Lockheed Martin, NASA, and the U.S. Department of Energy. In November 2025, lakeFS acquired the DVC open-source project from Iterative.ai, extending its reach from enterprise to individual practitioners.
lakeFS is an open-source and enterprise data version control platform that transforms object storage into Git-like repositories, enabling data and AI teams to branch, commit, merge, and roll back datasets at petabyte scale without copying data. Built by Treeverse and backed by $43M in funding, it supports reproducible ML workflows, data quality enforcement, and governance across multi-cloud and on-premises data lakes, with deep integration across the modern data and AI tooling stack.
Key Facts
- Founded
- 2020
- HQ
- New York, NY, USA
- Founders
- Oz Katz, Einat Orr
- Employees
- 11-50
- Funding
- $43M
- Customers
- >90,000 organizations
- Status
- Private
Target users
Key Capabilities10
- Git-like branching, committing, merging, and reverting for object storage data lakes at petabyte scale
- Zero-copy isolated dev/test environments via branches without data duplication
- Atomic commits and instant rollback for data pipeline error recovery
- Data CI/CD via configurable pre- and post-commit hooks for quality gates
- Full data lineage tracking and built-in audit trail for governance and compliance
- S3-compatible API enabling seamless integration with existing tools and frameworks
- Role-based access control (RBAC), SSO, and SCIM support (Enterprise tier)
- Iceberg REST Catalog support and format-agnostic versioning (structured and unstructured)
- lakeFS Mount for local filesystem-style access to remote data without full downloads
- Transactional mirroring and multi-storage backend support (Enterprise tier)
Key Use Cases8
- Reproducible ML/AI model training with versioned, immutable dataset snapshots
- Isolated ETL testing on production data without copying or risking production state
- Data quality enforcement via Write-Audit-Publish pipeline patterns
- Instant rollback and recovery from bad data incidents in production data lakes
- ML experiment tracking tied to specific data versions for auditability
- Data governance and compliance for regulated industries (FDA, DOE, defense)
- Multi-team collaboration on shared data lakes with branch-level isolation
- Managing petabyte-scale multimodal AI training data lifecycle across cloud environments
lakeFS customer outcomes
80% reduction in testing time
Enigma adopted lakeFS data branching and within days of migration had reduced testing time on two different data pipeline projects, with CTO Ryan Green crediting data branching for improved product velocity.
6 models launched per 2-week sprint with half the team (vs. 2–3 models with full team prior)
Ellips's ML engineering team implemented lakeFS and dramatically accelerated model deployment cadence, launching more models per sprint with a smaller team than previously possible.
Arm implemented lakeFS for automated data cleaning, version control, and governance across distributed teams, resulting in faster go-to-market, reduced storage costs, improved development velocity, and stronger data governance.
Paige AI used lakeFS alongside dbt to enable reproducible ML experiments, increase data team productivity, and satisfy FDA compliance requirements for AI-powered cancer diagnostics.
Recent Trend
How AI describes lakeFS3
lakeFS (most direct answer) lakeFS lakeFS treats your data lake (S3/GCS/Azure Blob) like a Git repository.
Which tool gives me reproducible dataset snapshots without copying terabytes of data?
...-------------- A common stack is: | Purpose | Tools | | --- | --- | | Code versioning | Git | | Dataset versioning | DVC, lakeFS | | Experiment tracking | MLflow | | Lineage metadata | OpenLineage, Marquez, DataHub | A widely used pattern is: Git...
How do I track dataset lineage from raw files through preprocessing to the final training set so experiments are reproducible?
LakeFS / Delta Lake / Apache Hudi / Iceberg * Each training dataset = snapshot ID * Writes are append/merge; reads are “as-of” versions * No in-place mutation of training data Key property: > “Given snapshot X, the...
Need atomic commits across data and code so I can roll back a model regression to its exact training snapshot — what works at scale?
Most cited sources8
7Preprocessing Data Locally with Zero Copy Using lakeFS Mount
lakefs.io·Blog Post
6Benefits Of Etl Pipeline...
lakefs.io·Blog Post
4Data Reproducibility and other Data Lake Best Practices
lakefs.io·Blog Post
3GitHub - treeverse/lakeFS: lakeFS - Data version control for your data lake | Git for data · GitHub
github.com·Product Page
2Best Data Version Control Tools in 2026 | lakeFS
lakefs.io·Blog Post
2Data Version Control with Python | lakeFS
lakefs.io·Listicle
Alternatives in AI Data Curation and Dataset Versioning6
lakeFS positions itself as the enterprise-grade 'control plane for AI-ready data,' differentiating through Git-like branching and versioning applied at petabyte scale to object storage (S3, GCS, Azure Blob).
- Unlike annotation- or labeling-focused tools in the AI data curation space (Encord, Roboflow, Voxel51), lakeFS operates at the data infrastructure layer, providing reproducibility, lineage, and governance for data lakes underpinning AI/ML pipelines.
- Its November 2025 acquisition of DVC from Iterative.ai extended market coverage from enterprise data engineering teams down to individual data scientists.
- It was named a Representative Vendor in the 2025 Gartner Market Guide for DataOps Tools, and is one of the few open-source-core data version control systems with a commercial enterprise tier at this scale.
Reviews
Praised
- Familiar Git-like UX for data engineers and developers
- Zero-copy branching with no data duplication overhead
- S3 API compatibility requiring no changes to existing tools
- Fast quickstart and easy local sandbox setup
- Format-agnostic versioning (Parquet, images, video, JSON, CSV)
- Active open-source community, Slack support, and sample notebooks
- Atomic commits that prevent partial or inconsistent data states
- Broad integration coverage across modern data and ML stack
Criticized
- Enterprise features gated behind contact-sales pricing with no public tiers
- Self-hosted open-source deployment can be complex for smaller teams
- Limited publicly verifiable third-party reviews on enterprise platforms
- Architecture tightly coupled to S3-compatible object storage
- Governance features (RBAC, SSO, audit logs) only available in Enterprise tier
- Historically less accessible for individual data scientists on small datasets
Formal review scores are not publicly available on major enterprise review platforms (G2, Gartner Peer Insights, TrustRadius) as of early 2026, reflecting lakeFS's open-source heritage and early commercial maturity. Community sentiment from GitHub (5.3k stars, 446 forks) and Slack is strongly positive, with practitioners praising Git-like UX, zero-copy branching, and broad S3 compatibility. Independent technical evaluations (e.g., Data Minded, 2021) historically noted deployment complexity and overhead for smaller teams, though the product has evolved significantly since.
Pricing
lakeFS offers two tiers: Open Source (free forever, self-hosted, includes core data version control, branching, merging, hooks, garbage collection, and S3 API compatibility) and Enterprise (unlimited seats, contact sales for pricing, adds RBAC, SSO, SCIM, IAM Roles, lakeFS Mount, Audit Logs, Transactional Mirroring, Iceberg REST Catalog, Metadata Search, multi-storage backend support, simplified garbage collection, SOC2 certification, and a support SLA). Cloud-hosted access ('Try lakeFS') is available. No public per-seat or consumption-based Enterprise pricing is disclosed.
Limitations
- Enterprise-tier features—including RBAC, SSO, SCIM, audit logs, lakeFS Mount, Iceberg REST Catalog, transactional mirroring, and SLA-backed support—require contacting sales with no published pricing. lakeFS is architecturally tied to S3-compatible object storage and is less suitable for non-object-storage data environments.
- Structured review presence on major enterprise platforms (G2, Gartner Peer Insights, TrustRadius) is minimal, limiting third-party validation for procurement teams.
- Self-hosting the open-source edition at scale may require significant DevOps expertise.
- The platform has historically been less suited for lightweight individual data science workflows, a gap partially addressed by the November 2025 acquisition of DVC.
Frequently asked questions
Topic coverageCoverage by buyer topic
Topic Coverage
Prompt-Level Results
| Prompt | |||
|---|---|---|---|
Curating multimodal training datasets0/5 cited (0%) | |||
Which platform handles parallel inference across millions of files for dataset enrichment without hitting OOM on a single machine? | |||
How do teams curate diverse, high-quality fine-tuning datasets for vision-language models from raw object storage? | |||
I have millions of unlabeled videos in S3 — which tool can help me filter and enrich them with model-generated metadata before training? | |||
What's the best way to curate a large image and video dataset for training a multimodal model? | |||
Looking for a Python SDK that lets me apply LLMs and vision models to clean and enrich a training dataset without moving data out of cloud storage. | |||
Dataset versioning and lineage for ML4/5 cited (80%) | |||
What's the cleanest way to version control datasets alongside code for an ML project? | |||
Need atomic commits across data and code so I can roll back a model regression to its exact training snapshot — what works at scale? | |||
How do I track dataset lineage from raw files through preprocessing to the final training set so experiments are reproducible? | |||
Looking for a Git-like workflow for branching, committing, and merging changes to large training datasets stored in S3. | |||
Which tool gives me reproducible dataset snapshots without copying terabytes of data? | |||
Detecting and fixing label errors0/5 cited (0%) | |||
What's the fastest workflow to find and re-label outliers in a 1M-image dataset? | |||
How do production ML teams audit annotation quality across labeling vendors before they ship to training? | |||
Which platforms use confident learning or model-based heuristics to flag bad labels for review? | |||
Looking for a tool that surfaces ambiguous and noisy labels in a multimodal dataset before I retrain. | |||
How can I automatically detect mislabeled examples in a computer vision training set? | |||
Embedding-based dataset exploration and deduplication0/5 cited (0%) | |||
How are teams using embedding maps to surface coverage gaps and bias in training data? | |||
Looking for a tool that clusters and deduplicates an image dataset based on semantic similarity. | |||
How do I find near-duplicate examples across a multimodal training corpus before fine-tuning? | |||
Which platform lets me search a dataset by example — give an image or text, get nearest neighbors with metadata? | |||
What's the best way to explore a huge text dataset visually using embeddings? | |||
Reproducible data pipelines over object storage2/5 cited (40%) | |||
Which tool supports incremental dataset builds — only reprocess the new files when underlying storage changes? | |||
How do I keep training datasets in sync with raw object storage while preserving versioned metadata, lineage, and access control? | |||
What's the cleanest way to author a dataset pipeline locally and scale it to hundreds of cloud workers without rewriting? | |||
Looking for a Python-native data pipeline framework that handles parallelism, checkpointing, and lineage without ETL infrastructure. | |||
How do I build a reproducible data preprocessing pipeline that reads from S3, applies Python transforms, and writes a versioned dataset? | |||
Turn this matrix into daily prompt monitoring.
Track prompt changesVertical Ranking
| # | Brand | PresencePres. | Share of VoiceSoV | DocsDocs | BlogBlog | MentionsMent. | Avg PosPos | Sentiment |
|---|---|---|---|---|---|---|---|---|
| 1 | lakeFS | 10.7% | 44.1% | 0.0% | 9.3% | 8.0% | #4.8 | +0.53 |
| 2 | Encord | 8.0% | 17.6% | 0.0% | 6.7% | 2.7% | #6.5 | +0.33 |
| 3 | Voxel51 | 5.3% | 11.8% | 0.0% | 5.3% | 1.3% | #4.8 | +0.38 |
| 4 | Roboflow | 5.3% | 11.8% | 0.0% | 4.0% | 0.0% | #7.5 | +0.34 |
| 5 | DataChain | 4.0% | 8.8% | 2.7% | 0.0% | 4.0% | #7.0 | +0.70 |
| 6 | Activeloop | 1.3% | 5.9% | 0.0% | 0.0% | 1.3% | #13.0 | +0.50 |
| 7 | Nomic AI | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | — | — |
Turn this into your team dashboard
Sign up to unlock project-level analytics, daily tracking, actionable insights, custom prompt configurations, adoption tracking, AI traffic analytics and more.
Free trial. Setup comes pre-filled from this report.