Question 1

What does lakeFS do?

Accepted Answer

lakeFS, developed by Treeverse, is an open-source data version control system that applies Git-like operations—branching, committing, merging, and reverting—to data lakes stored in object storage (Amazon S3, Azure Blob Storage, Google Cloud Storage). Founded in 2020 by Oz Katz and Dr. Einat Orr, lakeFS operates as a metadata layer atop existing storage without moving or duplicating data. It enables data and AI/ML teams to create isolated environments for testing, ensure reproducibility of model training, enforce data quality via CI/CD hooks, roll back from data incidents, and maintain full data lineage and audit trails. Available as a free open-source edition and a commercial Enterprise tier, lakeFS is used by organizations including Arm, Netflix, Volvo, Lockheed Martin, NASA, and the U.S. Department of Energy. In November 2025, lakeFS acquired the DVC open-source project from Iterative.ai, extending its reach from enterprise to individual practitioners.

lakeFS is an open-source and enterprise data version control platform that transforms object storage into Git-like repositories, enabling data and AI teams to branch, commit, merge, and roll back datasets at petabyte scale without copying data. Built by Treeverse and backed by $43M in funding, it supports reproducible ML workflows, data quality enforcement, and governance across multi-cloud and on-premises data lakes, with deep integration across the modern data and AI tooling stack.

Sources

lakefs.io lakefs.io lakefs.io lakefs.io github.com prnewswire.com

Question 2

Who is lakeFS best for?

Accepted Answer

lakeFS is built for Data engineers building and maintaining large-scale data lakes and ETL pipelines, ML/AI engineers and MLOps teams managing model training data and experiments, Data scientists requiring reproducible, versioned datasets for research, DataOps and platform engineering teams at data-intensive enterprises. Common use cases include Reproducible ML/AI model training with versioned, immutable dataset snapshots; Isolated ETL testing on production data without copying or risking production state; Data quality enforcement via Write-Audit-Publish pipeline patterns.

Question 3

How is lakeFS priced?

Accepted Answer

lakeFS offers two tiers: Open Source (free forever, self-hosted, includes core data version control, branching, merging, hooks, garbage collection, and S3 API compatibility) and Enterprise (unlimited seats, contact sales for pricing, adds RBAC, SSO, SCIM, IAM Roles, lakeFS Mount, Audit Logs, Transactional Mirroring, Iceberg REST Catalog, Metadata Search, multi-storage backend support, simplified garbage collection, SOC2 certification, and a support SLA). Cloud-hosted access ('Try lakeFS') is available. No public per-seat or consumption-based Enterprise pricing is disclosed.

Question 4

What are the alternatives to lakeFS?

Accepted Answer

Common AI Data Curation and Dataset Versioning alternatives to lakeFS include Encord, Voxel51, Nomic AI, Activeloop, DataChain. See the full comparison hub at /verticals/ai-data-curation-and-dataset-versioning/compare.

Question 5

What do users praise about lakeFS?

Accepted Answer

Users frequently praise: Familiar Git-like UX for data engineers and developers; Zero-copy branching with no data duplication overhead; S3 API compatibility requiring no changes to existing tools; Fast quickstart and easy local sandbox setup; Format-agnostic versioning (Parquet, images, video, JSON, CSV); Active open-source community, Slack support, and sample notebooks; Atomic commits that prevent partial or inconsistent data states; Broad integration coverage across modern data and ML stack.

Question 6

What are common complaints about lakeFS?

Accepted Answer

Frequently cited limitations: Enterprise features gated behind contact-sales pricing with no public tiers; Self-hosted open-source deployment can be complex for smaller teams; Limited publicly verifiable third-party reviews on enterprise platforms; Architecture tightly coupled to S3-compatible object storage; Governance features (RBAC, SSO, audit logs) only available in Enterprise tier; Historically less accessible for individual data scientists on small datasets.

Question 7

When was lakeFS founded and where?

Accepted Answer

lakeFS was founded in 2020, headquartered in New York, NY, USA by Oz Katz, Einat Orr.

Question 8

How big is lakeFS?

Accepted Answer

lakeFS reports 11-50 employees, >90,000 organizations customers.

Prompt	Gemini Search	ChatGPT	Perplexity
Curating multimodal training datasets0/5 cited (0%)
Which platform handles parallel inference across millions of files for dataset enrichment without hitting OOM on a single machine?
I have millions of unlabeled videos in S3 — which tool can help me filter and enrich them with model-generated metadata before training?
Looking for a Python SDK that lets me apply LLMs and vision models to clean and enrich a training dataset without moving data out of cloud storage.
How do teams curate diverse, high-quality fine-tuning datasets for vision-language models from raw object storage?
What's the best way to curate a large image and video dataset for training a multimodal model?
Dataset versioning and lineage for ML1/5 cited (20%)
What's the cleanest way to version control datasets alongside code for an ML project?
Looking for a Git-like workflow for branching, committing, and merging changes to large training datasets stored in S3.
How do I track dataset lineage from raw files through preprocessing to the final training set so experiments are reproducible?
Need atomic commits across data and code so I can roll back a model regression to its exact training snapshot — what works at scale?
Which tool gives me reproducible dataset snapshots without copying terabytes of data?
Detecting and fixing label errors0/5 cited (0%)
What's the fastest workflow to find and re-label outliers in a 1M-image dataset?
Looking for a tool that surfaces ambiguous and noisy labels in a multimodal dataset before I retrain.
Which platforms use confident learning or model-based heuristics to flag bad labels for review?
How can I automatically detect mislabeled examples in a computer vision training set?
How do production ML teams audit annotation quality across labeling vendors before they ship to training?
Embedding-based dataset exploration and deduplication0/5 cited (0%)
Which platform lets me search a dataset by example — give an image or text, get nearest neighbors with metadata?
How do I find near-duplicate examples across a multimodal training corpus before fine-tuning?
How are teams using embedding maps to surface coverage gaps and bias in training data?
What's the best way to explore a huge text dataset visually using embeddings?
Looking for a tool that clusters and deduplicates an image dataset based on semantic similarity.
Reproducible data pipelines over object storage1/5 cited (20%)
Looking for a Python-native data pipeline framework that handles parallelism, checkpointing, and lineage without ETL infrastructure.
What's the cleanest way to author a dataset pipeline locally and scale it to hundreds of cloud workers without rewriting?
Which tool supports incremental dataset builds — only reprocess the new files when underlying storage changes?
How do I build a reproducible data preprocessing pipeline that reads from S3, applies Python transforms, and writes a versioned dataset?
How do I keep training datasets in sync with raw object storage while preserving versioned metadata, lineage, and access control?

#	Brand	PresencePres.	Share of VoiceSoV	DocsDocs	BlogBlog	MentionsMent.	Avg PosPos	Sentiment
1	Voxel51	4.0%	23.1%	0.0%	2.7%	1.3%	#6.0	+0.50
2	Encord	4.0%	38.5%	0.0%	4.0%	0.0%	#6.4	+0.00
3	lakeFS	2.7%	23.1%	0.0%	2.7%	1.3%	#4.7	+0.00
4	Nomic AI	1.3%	15.4%	1.3%	0.0%	0.0%	#6.0	+0.70
5	Activeloop	0.0%	0.0%	0.0%	0.0%	0.0%	—	—
6	DataChain	0.0%	0.0%	0.0%	0.0%	0.0%	—	—
7	Roboflow	0.0%	0.0%	0.0%	0.0%	0.0%	—	—

AI visibility report for lakeFS

Key Metrics

Platform Breakdown

Overview

Key Facts

Key Capabilities10

Key Use Cases8

lakeFS customer outcomes

Recent Trend

How AI describes lakeFS3

Most cited sources3

Alternatives in AI Data Curation and Dataset Versioning6

Reviews

Pricing

Limitations

Frequently asked questions

Topic Coverage

Prompt-Level Results

Strengths2

Gaps2

Vertical Ranking

Turn this into your team dashboard