Zeming Chen | BEV-VAE & Generative Modeling

Biography

I work on generative modeling and 3D perception for autonomous driving. My research focuses on using AIGC techniques such as diffusion models and autoencoders to construct world models from multi-view images. I am also interested in vision-language models (VLM) and vision-language-action (VLA) models for end-to-end autonomous driving, aiming to bridge perception, reasoning, and planning.

Education

Tsinghua University, M.Eng. in Computer Science, 2020–2023
Dalian University of Technology, B.Eng. in Automation, 2016–2020

Experience

OpenMMLab, Shanghai AI Lab (2021.11–2023.03)

Developed and maintained MMDetection 3.x core modules.
Designed the first general semi-supervised detection framework with plug-and-play config support. (user guide).

Research

BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

Proposed a vision-only self-supervised BEV-VAE framework that encodes multi-view images into structured BEV latent space.
Demonstrated BEV latent space supports DiT-based generation, analogous to Stable Diffusion.
Offers a new path for building world models in autonomous driving. 💻 Repo
*Note: Paper currently not public due to special reasons.*

Overview of the BEV-VAE framework with spatially consistent generation.

MixPL: Semi-supervised Object Detection with Mixed Pseudo Labels

Proposed MixPL to mitigate imbalance in pseudo-label-based SSOD (foreground/background, scale, class).
Achieved SOTA performance on COCO across multiple detection backbones.
📄 Paper
💻 Code
🧠 Model Checkpoints
📊 Leaderboard
🥇 Ranked 1st in the object detection track of the V3Det Challenge 2024, hosted at CVPR'24 in conjunction with the 4th Open World Vision Workshop (VPLOW).

The framework of DetMeanTeacher and MixPL for Semi-supervised Object Detection.

Contact

I'm open to collaborations in world model and end-to-end autonomous driving. Feel free to reach out via email or connect through GitHub.