Schedule for: 23w5153 - 3D Generative Models

Beginning on Sunday, July 9 and ending Friday July 14, 2023

All times in Banff, Alberta time, MDT (UTC-6).

Sunday, July 9
16:00 - 17:30	Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)
20:00 - 22:00	Informal gathering ↓ Meet & Greet in BIRS Lounge (PDC 2nd Floor) (Other (See Description))

Monday, July 10
07:00 - 08:45	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:45 - 09:00	Introduction from BIRS Staff (TCPL 201)
09:00 - 10:00	Ke Li: Introduction and Welcome from Organizers (TCPL 201)
10:00 - 10:25	Towaki Takikawa: Copyright, Provenance, and Attribution for Generative Art ↓ The advent of text-based generative art (both 2D and 3D) and its recent widespread availability through open source projects like Stable Diffusion and threestudio has led to rapid development these systems. At the same time, this technology can potentially pose an economic threat to an entire industry of visual artists, and there seems to be a need for an incentive structure for artists through generative AI. This talk will outline some of the current regulations (and challenges) around intellectual property for generative art, and introduce some research challenges that potential incentive structures (like provenance and attribution) pose for generative art. $$$$ Bio: Towaki Takikawa is a Ph.D. student at the University of Toronto, advised by professors Alec Jacobson and Sanja Fidler. He is also a research scientist at NVIDIA Research, working with David Luebke on the Hyperscale Graphics Systems research group. His research resides at the intersection of computer graphics, computer vision, and machine learning, with specific interests in algorithms and data structures for efficient neural representations for applications like 3D reconstruction and 3D generative modeling. He received his undergraduate degree in computer science from the University of Waterloo and enjoys playing sports and music in his freetime. (TCPL 201)
10:25 - 10:50	David Fleet: 3D Structure and Motion at Atomic Resolutions ↓ One of the foremost problems in structural biology concerns the inference of the atomic-resolution 3D structure of biomolecules from electron cryo-microscopy (cryo-EM). The problem, in a nutshell, is a form of multi-view 3D reconstruction, inferring the 3D electron density of a particle from large numbers of noisy images from an electron microscope. I'll outline the nature of the problem and several of the key algorithmic developments, with particular emphasis on the challenging case in which the imaged molecule exhibits a wide range of (non-rigid) conformational variation. Through single particle cryo-EM, methods from computer vision and machine learning are reshaping structural biology and drug discovery. Joint work with Ali Punjani. $$$$ Bio: David Fleet is a Research Scientist at Google DeepMind (since 2020) and a Professor of Computer Science at the University of Toronto (since 2004). From 2012-2017 he served as Chair of the Department of Computer and Mathematical Sciences, University of Toronto Scarborough. Before joining the University of Toronto, he worked at Xerox PARC (1999-2004) and Queen's University (1991-1998). He received the PhD in Computer Science from the University of Toronto in 1991. He as awarded an Alfred P. Sloan Research Fellowship in 1996 for his research on visual neuroscience. He received research paper awards at ICCV 1999, CVPR 2001, UIST 2003, BMVC 2009, and NeurIPS 2022. In 2010, with Michael Black and Hedvig Sidenbladh he received the Koenderink Prize for fundamental contributions to computer vision. In 2022, his work on cryo-EM with Ali Punjani received the Paper of the Year Award from the Journal of Structural Biology. He served as Associate Editor of IEEE Trans PAMI (2000-2004), as Program Co-Chair for CVPR (2003) and ECCV (2014), and as Associate Editor-In-Chief for IEEE Trans PAMI (2005-2008). He was Senior Fellow of the Canadian Institute of Advanced Research (2005-2019), and currently holds a Canadian CIFAR AI Chair. His current research interests span computer vision, image processing, machine learning and computational biology. (TCPL 201)
10:50 - 11:15	Alex Yu: Connecting NeRFs, generative 3D content, and real problems ↓ Luma's vision is 3D for everyone -- making 3D interactive content abundant and accessible like videos are today. Since launching our NeRF capture app and text-to-3d service about half a year ago, Luma has caught much attention both in academic circles and among an enthusiast community, but a substantial gap remains before this 3D content, captured and/or generated, would be broadly appealing. In this talk, I will outline particular challenges in developing and applying models in the 3D medium vs. images/videos, introduce some of our work towards resolving practical issues, and argue how insights from NeRFs and neural rendering will continue to be relevant in generative 3D research. $$$$ Bio: Alex is a co-founder at Luma AI (lumalabs.ai), a startup re-imagining 3D content creation using neural fields and generative models. As CTO, Alex is leading the development of Luma’s NeRF-based 3D reconstruction and neural rendering technologies as well as Luma’s 3D generative model efforts. Prior to Luma, he worked with Prof. Angjoo Kanazawa and Matthew Tancik at UC Berkeley on NeRF-related research such as Plenoxels (CVPR 22), PlenOctrees (ICCV 21), and PixelNeRF (CVPR 21). (TCPL 201)
11:15 - 11:40	Helge Rhodin: Can Generative Models Surpass Contrastive Representation Learning? ↓ First, I will introduce the generative models I have developed for the domains of human motion capture, pose estimation, and novel view synthesis. Specifically, I will focus on models that produce disentangled and interpretable latent codes. Next, I will outline our ongoing work on expanding these methods to general representation learning, offering an alternative approach to contrastive learning. $$$$ Bio: Helge Rhodin is an Assistant Professor at UBC and is affiliated with the computer vision and graphics labs. Prior to this, he held positions as a lecturer and postdoc at EPFL and completed his PhD at the MPI for Informatics at Saarland University. Rhodin's research has made significant contributions to 3D computer vision and self-supervised machine learning, enabling applications in sports, medicine, neuroscience, and augmented reality. (TCPL 201)
12:00 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 14:00	Guided Tour of The Banff Centre ↓ Meet in the PDC front desk for a guided tour of The Banff Centre campus. (PDC Front Desk)
14:00 - 14:20	Group Photo ↓ Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo! (TCPL Foyer)
14:30 - 14:55	Ben Poole: Neural unReal Fields ↓ TBD (TCPL 201)
14:55 - 15:20	Zhiqin Chen: Interactive modeling via conditional detailization ↓ There have been significant advancements in deep generative models, particularly in the field of text-to-shape models powered by diffusion and large language models. However, several fundamental challenges remain unaddressed, including the lack of fine-grained control and slow inference speed. In this talk, I will introduce a different approach to generate high-resolution, detailed geometries via conditional detailization. Specifically, our deep generative network detailizes an input coarse voxel shape through voxel upsampling, while conditioning the style of the geometric details on an high-resolution exemplar shape. Our approach provides fine-grained structural control as artists can freely edit the coarse voxels to define the overall structure of the detailized shape. Additionally, our approach offers fast inference, enabling real-time interactive 3D modeling. $$ $$ Zhiqin is an incoming research scientist at Adobe. He received his PhD and Master's degree from Simon Fraser University, supervised by Prof. Hao (Richard) Zhang, and obtained his Bachelor's degree from Shanghai Jiao Tong University. He won the best student paper award at CVPR 2020 and best paper award candidate at CVPR 2023. He was an NVIDIA graduate fellowship finalist and received Google PhD Fellowship in 2021. He has also interned at Adobe, NVIDIA, and Google in the past. His research interest is in computer graphics with a specialty in geometric modeling, machine learning, 3D reconstruction, and shape synthesis. (TCPL 201)
15:20 - 15:45	Rana Hanocka: Data-Driven Shape Editing - without 3D Data ↓ Much of the current success of deep learning has been driven by massive amounts of curated data, whether annotated and unannotated. Compared to image datasets, developing large-scale 3D datasets is either prohibitively expensive or impractical. In this talk, I will present several works which harness the power of data-driven deep learning for tasks in shape editing and processing, without any 3D datasets. I will discuss works which learn to synthesize and analyze 3D geometry using large image datasets. $$$$ Bio: Rana Hanocka is an Assistant Professor at the University of Chicago and holds a courtesy appointment at the Toyota Technological Institute at Chicago (TTIC). Rana founded and directs the 3DL (Threedle) research collective, comprised of enthusiastic researchers passionate about 3D, machine learning, and visual computing. Rana’s research interests span computer graphics, computer vision, and machine learning. Rana completed her Ph.D. at Tel Aviv University under the supervision of Daniel Cohen-Or and Raja Giryes. Her Ph.D. research focused on building neural networks for irregular 3D data and applying them to problems in geometry processing. (TCPL 201)
15:45 - 16:10	Noam Rotstein: FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions ↓ Abstract: Image captioning is a central task in computer vision which has experienced substantial progress following the advent of vision-language pre-training techniques. In this work, we highlight a frequently overlooked limitation of captioning models that often fail to capture semantically significant elements. This drawback can be traced back to the text-image datasets; while their captions typically offer a general depiction of image content, they frequently omit salient details. To mitigate this limitation, we propose FuseCap - a novel method for enriching captions with additional visual information, obtained from vision experts, such as object detectors, attribute recognizers, and Optical Character Recognizers (OCR). Our approach fuses the outputs of such vision experts with the original caption using a large language model (LLM), yielding enriched captions that present a comprehensive image description. We validate the effectiveness of the proposed caption enrichment method through both quantitative and qualitative analysis. Our method is then used to curate the training set of a captioning model-based BLIP which surpasses current state-of-the-art approaches in generating accurate and detailed captions while using significantly fewer parameters and training data. In future work, we aim to adopt techniques presented in this work into the 3D domain. $$$ Bio: Noam Rotstein is a PhD student based in Haifa District, Israel, specializing in computer vision, 3D vision, and deep learning. Currently enrolled at the Technion - Israel Institute of Technology, he conducts research under the guidance of Professor Ron Kimmel. Having successfully completed his Master's degree under Professor Kimmel, Noam also holds a Bachelor's degree in Electrical Engineering from the Technion. In his professional capacity, he is an AI Algorithm Developer at Lumix.AI, focusing on video analytics following previous research internships at Intel RealSense and Eye-Minders. His current research centers around multimodal learning. (TCPL 201)
16:30 - 18:30	Hike / Free time (TCPL Foyer)
18:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Tuesday, July 11
07:00 - 08:25	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:25 - 08:50	Andrea Tagliasacchi: Neural fields for 3D Vision, a MAP perspective ↓ Neural fields rapidly emerged as an essential tool in 3D visual perception. When there is an abundance of data, we can employ simple maximum-likelihood techniques to optimize the parameters of a neural field. However, what if we don't have enough data? In such cases, how can we incorporate effective prior knowledge to eliminate unfavorable local minima during optimization? (i.e., optimizing in a maximum-a-posteriori manner). In this presentation, I will provide a historical overview of how priors have been developed in my previous research, and offer an educated guess on their potential integration into general-purpose 3D reasoning networks, towards the realization of 3D foundation models. $$$$ Bio: Andrea Tagliasacchi is an associate professor in the at Simon Fraser University (Vancouver, Canada) where he holds the appointment of “visual computing research chair” within the school of computing science. He is also a part-time (20%) staff research scientist at Google Brain (Toronto), as well as an associate professor (status only) in the computer science department at the University of Toronto. Before joining SFU, he spent four wonderful years as a full-time researcher at Google (mentored by Paul Lalonde, Geoffrey Hinton, and David Fleet). Before joining Google, he was an assistant professor at the University of Victoria (2015-2017), where he held the Industrial Research Chair in 3D Sensing (jointly sponsored by Google and Intel). His alma mater include EPFL (postdoc) SFU (PhD, NSERC Alexander Graham Bell fellow) and Politecnico di Milano (MSc, gold medalist). His research focuses on 3D visual perception, which lies at the intersection of computer vision, computer graphics and machine learning Bio: (TCPL 201)
08:50 - 09:15	Shangzhe Wu: Learning Dynamic 3D Objects in the Wild ↓ We live in a dynamic physical world, surrounded by all kinds of 3D objects. Designing perception systems that can see the world in 3D from only 2D observations is not only key to many AR and robotics applications, but also a cornerstone for general visual understanding. Prevalent learning-based methods often treat images simply as compositions of 2D patterns, ignoring the fact that they arise from a 3D world. The major obstacle is the lack of large-scale 3D annotations for training, which are prohibitively expensive to collect. Natural intelligences, on the other hand, develop comprehensive 3D understanding of the world primarily by observing 2D projections, without relying on extensive 3D supervision. Can machines learn to see the 3D world without explicit 3D supervision? In this talk, I will present some of our recent efforts in learning physically-grounded, disentangled representations for dynamic 3D objects from raw 2D observations in the wild, through an inverse rendering framework. In particular, I will focus on a recent project, MagicPony, which learns articulated 3D animals simply from online image collections. $$$$ Bio: Shangzhe Wu is a Postdoc Researcher at Stanford University working with Jiajun Wu. He obtained his PhD from University of Oxford, advised by Andrea Vedaldi and Christian Rupprecht at the Visual Geometry Group (VGG). His current research focuses on unsupervised 3D perception and inverse rendering. He also spent time interning at Google Research with Noah Snavely. His work on unsupervised learning of symmetric 3D objects received the Best Paper Award at CVPR 2020. Homepage: https://elliottwu.com. (TCPL 201)
09:15 - 09:40	Jiguo Cao: Supervised two-dimensional functional principal component analysis with time-to-event outcomes and mammogram imaging data ↓ Screening mammography aims to identify breast cancer early and secondarily measures breast density to classify women at higher or lower than average risk for future breast cancer in the general population. Despite the strong association of individual mammography features to breast cancer risk, the statistical literature on mammogram imaging data is limited. While functional principal component analysis (FPCA) has been studied in the literature for extracting image-based features, it is conducted independently of the time-to-event response variable. With the consideration of building a prognostic model for precision prevention, we present a set of flexible methods, supervised FPCA (sFPCA) and functional partial least squares (FPLS), to extract image-based features associated with the failure time while accommodating the added complication from right censoring. Throughout the article, we hope to demonstrate that one method is favored over the other under different clinical setups. The proposed methods are applied to the motivating data set from the Joanne Knight Breast Health cohort at Siteman Cancer Center. Our approaches not only obtain the best prediction performance compared to the benchmark model, but also reveal different risk patterns within the mammograms. $$$$ Bio: Dr. Jiguo Cao is the Canada Research Chair in Data Science and Professor at the Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada. Dr. Cao’s research interests include functional data analysis (FDA), image analysis, and machine learning. His statistical methods are applied to real-world problems across various disciplines, including neuroscience, public health, image analysis, genetics, pharmacology, ecology, environment, and engineering. He was awarded the prestigious CRM-SSC award in 2021 jointly from the Statistical Society of Canada (SSC) and Centre de recherches mathématiques (CRM) to recognize his research excellence and accomplishments. (TCPL 201)
09:40 - 10:05	Ke Li: Regression Done Right ↓ Regression is the tool of choice every time one would like to predict a continuously-valued variable, be it the colour of a pixel, a depth value or a 3D coordinate. It is commonly used to solve inverse problems, where the goal is to recover original data from its corrupted version. Examples of inverse problems abound in both 2D and 3D, e.g., super-resolution, colourization, depth estimation, 3D reconstruction, etc. The corruption process is typically many-to-one, so there could be many possible versions of the original data that are all consistent with the corrupted observation. Yet, regression only yields a single prediction, which is often blurry or desaturated. In this talk, I will show a general way to solve this problem using conditional Implicit Maximum Likelihood Estimation (cIMLE), which can generate arbitrarily many predictions that are all consistent with the corrupted data. I will then illustrate applications to super-resolution, colourization, image synthesis, 3D shape synthesis, monocular depth estimation and exposure control. $$$$ Bio: Ke Li is an Assistant Professor and Visual Computing Chair in the School of Computing Science at Simon Fraser University and is broadly interested in computer vision, machine learning, NLP and algorithms. He was previously a Member of the Institute for Advanced Study (IAS), and received his Ph.D. from UC Berkeley and B.Sc. from the University of Toronto. (TCPL 201)
10:05 - 10:35	Coffee Break (TCPL Foyer)
10:35 - 11:00	Noam Aigerman: Deformations for Generative Modeling ↓ Deformations of shapes are one of the most expressive mathematical tools in geometry processing. Beyond immediate 3D vision and computer graphics tasks such as registration and animation, they are a key ingredient in most generative methods related to explicit domains (e.g., triangle meshes), and can also be extremely beneficial in other modalities, such as neural fields. In this talk I will discuss a few of my recent works on incorporating geometric deformations into the deep learning pipeline, focusing on a few examples showing their versatility and efficacy within generative techniques, such as text-guided deformation of meshes and text-guided generation of Escher-like tileable 2D illustrations. $$$$ Bio: Noam Aigerman is an incoming Assistant Professor at the University of Montreal, and currently a research scientist at Adobe Research. Prior to that he obtained his PhD at the Weizmann Institute of Science under the supervision of Prof. Yaron Lipman. His field of research lies at the intersection between geometry processing and deep learning. (TCPL 201)
11:00 - 11:25	Robin Rombach: Generative Modeling in \ strikethrough{Latent Space} Compute Constrained Environments ↓ The open source release of Stable Diffusion has recently caused a "cambrian explosion of creative AI tools''. In this talk, I will discuss the underlying generative modeling paradigm and the training process of Stable Diffusion. In particular, I will discuss two-stage approaches for efficient generative modeling, focusing on autoregressive transformers and diffusion models trained in the latent space of VQGAN and related autoencoding models. $$$$ Bio: Robin is a research scientist at Stability AI and a PhD student at LMU Munich. After studying physics at the University of Heidelberg from 2013-2020, he started a PhD in computer science in the Computer Vision group in Heidelberg in 2020 under the supervision of Björn Ommer and moved to LMU Munich with the research group in 2021. His research focuses on generative deep learning models, in particular text-to-image systems. During his PhD, Robin was instrumental in the development and publication of several now widely used projects, such as VQGAN and Taming Transformers, and Latent Diffusion Models. In collaboration with Stability AI, Robin scaled the latent diffusion approach and published a series of models now known as Stable Diffusion, which have been widely adapted by the community and sparked a series of research papers. Robin is a proponent of open source ML models. (TCPL 201)
11:30 - 12:15	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
12:30 - 18:00	Lake Louise / Rafting / Free time ↓ Activity depending on what was booked by attendee (TCPL Foyer)
18:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Wednesday, July 12
07:00 - 08:25	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:30 - 08:55	Terrance DeVries: Generative Scene Networks ↓ Most recent work on 3D generation focuses on simplified settings such as isolated central objects or aligned forward facing scenes. In this work we make an initial attempt at extending 3D generation to the realm of complex, realistic indoor scenes. We introduce Generative Scene Networks (GSN), a 3D generative model which is trained on posed RGBD images and generates 3D scenes parameterized as neural radiance fields (NeRFs), allowing for unconstrained camera movement and exploration. In order to handle the additional complexity of larger scenes we introduce a local conditioning mechanism for NeRF, allowing us to model a diverse distribution of scenes with higher fidelity than previous methods. $$$$ Bio: Terrance DeVries is a research scientist and founding team member at Luma AI. His research revolves around improving the accessibility and efficiency of 3D content creation, with a specific focus on 3D capture and generation using foundational models. Terrance received his PhD from the University of Guelph in 2021, where he was advised by Graham Taylor. (TCPL 201)
08:55 - 09:20	Noah Snavely: What Goes Wrong When Running 3D Reconstruction at Scale ↓ Abstract: TBD $$$$ Bio: Noah Snavely is an associate professor of Computer Science at Cornell University and Cornell Tech, and also a research scientist at Google. Noah's research interests are in computer vision and graphics, in particular 3D understanding and depiction of scenes from images. Noah is the recipient of a PECASE, an Alfred P. Sloan Fellowship, a SIGGRAPH Significant New Researcher Award, and a Helmholtz Prize. (TCPL 201)
09:20 - 09:45	Bolei Zhou: Large-scale Scene Generation and Simulation from Bird-Eye’s View Layout ↓ Bird-Eye’s View (BEV) Layout explicitly describes spatial configuration of objects and their relations in a scene. In this talk, I will present our works of utilizing BEV as the intermediate representation for scene generation and scene simulation. Given a BEV layout, we can render objects such as vehicles and pedestrians into realistic traffic scenes. By looping a physical simulator, we can further simulate a diverse set of the interactive environments and use them to train and evaluate embodied AI agent. $$$$ Bio: Bolei Zhou is an Assistant Professor in the Computer Science Department at the University of California, Los Angeles (UCLA). He earned his Ph.D. from MIT in 2018. His research interest lies at the intersection of computer vision and machine autonomy, focusing on enabling interpretable human-AI interaction. He has developed many widely used neural network interpretation methods such as CAM and Network Dissection, as well as computer vision benchmarks Places and ADE20K. He has been area chair for CVPR, ECCV, ICCV, and AAAI. He received MIT Tech Review's Innovators under 35 in Asia-Pacific Award. (TCPL 201)
09:45 - 10:10	Ben Mildenhall: 3D Representations: Learning vs. Knowing ↓ Computer graphics gives us excellent models for rendering physically realistic images from virtual representations of 3D content. In a sense, we (and physics) “know” these forward models to be true. However, it is not always the case that the most precisely modeled forward pipeline is the easiest to invert, and we often must strike a balance between hard-coded and “learned” components in our representations and neural rendering pipelines. This talk will address this tension, with some anecdotes from work on 3D reconstruction and generation. $$$$ Bio - Ben Mildenhall is a research scientist at Google, where he works on problems at the intersection of graphics and computer vision. He received his PhD in computer science at UC Berkeley in 2020, advised by Ren Ng. His thesis work on neural radiance fields was awarded the ACM Doctoral Dissertation Award Honorable Mention and David J. Sakrison Memorial Prize. (TCPL 201)
10:05 - 10:45	Coffee Break (TCPL Foyer)
10:45 - 11:10	Arianna Rampini: Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation ↓ In recent years, large pre-trained models have revolutionized creative applications in 3D vision, particularly in tasks like text-to-shape generation. This talk explores the remarkable potential of using these models to generate 3D shapes from sketches. By conditioning a 3D generative model on features extracted from synthetic renderings, we can effectively generate 3D shapes from sketches at inference time, without the need for paired datasets during training. I will show how our method can generalize across voxel, implicit, and CAD representations and synthesize consistent 3D shapes from a variety of inputs ranging from casual doodles to professional sketches with different levels of ambiguity. $$$ Bio: Arianna Rampini is a Research Scientist at Autodesk since 2022. She holds a Ph.D. in Computer Science and a master’s degree in Theoretical Physics from Sapienza University (Italy). During her doctoral studies, she was a member of the GLADIA research group on geometry processing and machine learning, under the supervision of Emanuele Rodolà. At Autodesk, Arianna has worked on projects involving 3D generation and learning similarity in 3D data. (TCPL 201)
11:10 - 11:35	Jungtaek Kim: Combinatorial 3D Shape Assembly with LEGO Bricks ↓ A 3D shape is generated by placing unit primitives, e.g., voxels and points. Supposing that LEGO bricks are given as unit primitives, the problem of 3D shape generation can be considered a class of combinatorial optimization problems. In particular, to solve such a problem, we need to take into account complex constraints between LEGO bricks, which are related to disallowance of overlap and feasible attachment. In this talk, we present our methods for the 3D shape assembly using sequential decision-making processes, i.e., Bayesian optimization and reinforcement learning, and the recent work on efficient constraint satisfaction for the shape assembly with LEGO bricks. $$$$ Bio: Jungtaek Kim is a postdoctoral associate at the University of Pittsburgh, working with Prof. Paul W. Leu, Prof. Satish Iyengar, Prof. Lucas Mentch, and Prof. Oliver Hinder. Before that, he was a postdoctoral researcher at POSTECH, in the group of Prof. Minsu Cho. He received B.S. in Mechanical Engineering and Computer Science and Engineering from POSTECH in 2015, and Ph.D. in Computer Science and Engineering from POSTECH in 2022, under the supervision of Prof. Seungjin Choi and Prof. Minsu Cho. He interned at Vector Institute and SigOpt (acquired by Intel), during his Ph.D. program. He presented his work as the first author or a co-author, at the top-tier machine learning conferences such as NeurIPS, AISTATS, UAI, ICML, and ICLR. (TCPL 201)
11:35 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 13:25	Shunsuke Saito: Interaction-aware Composition of 3D Generative Avatars ↓ Deep generative models have been recently extended to synthesizing 3D digital humans. However, full-body humans with clothing are typically treated as a single chunk of geometry without considering the compositionality of clothing and accessories. As a result, individual items cannot be naturally composed into novel identities, leading to limited expressiveness and controllability of generative 3D avatars. Also modeling each item in isolation does not lead to plausible composition due to the geometric and appearance interactions of objects and humans. In this talk, I will introduce our recent works to address these challenges. In particular, I will first discuss how interaction-aware 3D generative models can be learned from the joint capture of eyeglasses and human heads. Then, I will introduce our recent attempt to make the interaction-aware generative modeling more scalable by leveraging paired 3D data of humans with and without objects, and applying a neural-version of arithmetic operations. $$$$ Bio: Shunsuke Saito is a Research Scientist at Meta Reality Labs Research in Pittsburgh. He obtained his PhD degree at the University of Southern California. Prior to USC, he was a Visiting Researcher at University of Pennsylvania in 2014. He obtained his BE (2013), ME (2014) in Applied Physics at Waseda University. His research lies in the intersection of computer graphics, computer vision and machine learning, especially centered around digital human, 3D reconstruction, and performance capture. His real-time volumetric teleportation work won Best in Show award in SIGGRAPH 2020 Real-time Live!. (TCPL 201)
13:25 - 13:50	Shichong Peng: Editable 3D Scene Representations with Proximity Attention Point Rendering (PAPR) ↓ Abstract: Neural Radiance Fields (NeRF) have shown great promise in modeling static 3D scenes by encoding volumetric scene representations into neural network weights. However, it is non-trivial how to modify the weights of the network to edit the scene. Current NeRF-based approaches often resort to global deformation fields or explicit geometry proxies, which are laborious and prone to errors. In contrast, methods that directly model the scene with explicit geometry do not face these limitations and are better equipped for scene editing tasks. In this talk, I will introduce a novel approach called Proximity Attention Point Rendering (PAPR) that directly learns explicit scene geometry using point clouds. Our method addresses the vanishing gradient problem encountered in existing point-based methods and effectively learns a point cloud that uniformly spreads across the scene's surface, even when the initial point cloud initialization significantly deviates from the target geometry. Remarkably, our method enables scene editing in various aspects without the need for additional supervision. It allows users to perform geometry editing, object manipulation, texture transfer, and exposure control seamlessly. By enabling end-to-end learning and effortless editing of real-world scenes, our approach opens up new possibilities for animating objects and facilitating rich interactions within virtual environments. $$$$ Bio: Shichong Peng is a third year PhD student in the APEX lab at Simon Fraser University, supervised by Ke Li. He received his Master’s degree from UC San Diego in 2021, and Bachelor’s degree from University of Toronto in 2019. His research interests are generative models, 3D reconstruction and animation, computer vision and graphics, and machine learning. (TCPL 201)
13:50 - 14:15	Igor Gilitschenski: Data-Driven Environment Models for Robotics ↓ As autonomous robots and vehicles see growing deployment, ensuring operational safety in real-world edge cases has become one of the main challenges for wide-spread adoption. Using handcrafted perception, dynamics, and behavior models is typically insufficient to capture the nuance required for handling rare scenarios and challenging appearance conditions. On the other hand, collecting datasets containing a sufficient amount of such edge cases can be dangerous or even completely infeasible. Thus, there has been a growing in developing techniques for better leveraging real-world data. This talk will discuss some of the challenges and opportunities of building environment models from real-world data for robotics and what recent advances in generative modelling and neural radiance fields can bring to the table. $$$$ Bio: Igor Gilitschenski is an Assistant Professor of Computer Science at the University of Toronto where he leads the Toronto Intelligent Systems Lab. He is also a (part-time) Research Scientist at the Toyota Research Institute. Prior to that, Dr. Gilitschenski was a Research Scientist at MIT’s Computer Science and Artificial Intelligence Lab and the Distributed Robotics Lab (DRL) where he was the technical lead of DRL’s autonomous driving research team. He joined MIT from the Autonomous Systems Lab of ETH Zurich where he worked on robotic perception, particularly localization and mapping. Dr. Gilitschenski obtained his doctorate in Computer Science from the Karlsruhe Institute of Technology and a Diploma in Mathematics from the University of Stuttgart. His research interests involve developing novel robotic perception and decision-making methods for challenging dynamic environments. He is the recipient of several best paper awards including at the American Control Conference, the International Conference of Information Fusion, and the Robotics and Automation Letters. (TCPL 201)
14:15 - 14:40	David Lindell: Neural rendering at one trillion frames per second ↓ The world looks different at one trillion frames per second: light slows to a crawl, and we can observe “transients”—ultrafast light transport phenomena that are completely invisible to the naked eye. These transients provide valuable cues for recovering 3D appearance and geometry and can be a powerful signal for scene reconstruction when combined with neural representations. While neural representations have led to profound improvements in 3D reconstruction using conventional RGB images, application to transients has been largely unexplored despite the widespread deployment of ultrafast imagers (e.g., the lidar array on the iPhone). In this talk, I overview applications of ultrafast imaging and describe recent work on transient neural rendering for 3D reconstruction and synthesis of transients from novel views. I also discuss challenges and opportunities for neural rendering at one trillion frames per second, including free viewpoint rendering of transient phenomena, handling of relativistic effects, and applications to imaging with individual photons. $$$$ Bio: David Lindell is an Assistant Professor in the Department of Computer Science at the University of Toronto. His research combines optics, emerging sensor platforms, neural representations, and physics-based algorithms to enable new capabilities in visual computing. Prof. Lindell’s research has a wide array of applications including autonomous navigation, virtual and augmented reality, and remote sensing. Prior to joining the University of Toronto, he received his Ph.D. from Stanford University. He is the recipient of the 2021 ACM SIGGRAPH Outstanding Dissertation Honorable Mention Award. (TCPL 201)
14:40 - 15:20	Coffee Break (TCPL Foyer)
15:20 - 15:45	Qixing Huang: Geometric Regularizations for Generative Modeling ↓ Parametric generative models, which map a latent parameter space to instances in an ambient space, enjoy various applications in 3D Vision and related domains. A standard scheme of these models is \textsl{probabilistic}, which aligns the induced ambient distribution of a generative model from a prior distribution of the latent space with the empirical ambient distribution of training instances. While this paradigm has proven to be quite successful on images, its current applications in 3D shape generation encounter fundamental challenges in the limited training data and generalization behavior. The key difference between image generation and shape generation is that 3D shapes possess various priors in geometry, topology, and physical properties. Existing probabilistic 3D generative approaches do not preserve these desired properties, resulting in synthesized shapes with various types of distortions. In this talk, I will discuss recent works in my group that seek to establish a novel geometric framework for learning shape generators. The key idea is to view a generative model as a sub-manifold embedded in the ambient space and develop differential geometry tools to model various geometric priors of 3D shapes through differential quantities of this sub-manifold. We will discuss the applications in deformable shape generation and joint shape matching. $$$$ Bio: Qixing Huang is an associate professor with tenure at the computer science department of the University of Texas at Austin. His research sits at the intersection of graphics, geometry, optimization, vision, and machine learning. He has published more than 90 papers at leading venues across these areas. His research has received several awards, including multiple best paper awards, the best dataset award at Symposium on Geometry Processing 2018, IJCAI 2019 early career spotlight, and 2021 NSF Career award. He has also served as area chairs of CVPR and ICCV and technical papers committees of SIGGRAPH and SIGGRAPH Asia, and co-chaired Symposium on Geometry Processing 2020. (TCPL 201)
15:45 - 16:10	Ali Mahdavi-Amiri: Generative AI for VFX and Design ↓ In this talk, I will begin by providing an overview of the research problems I focus on in the domains of generative modeling, 3D deep learning, and geometric modeling. Subsequently, I will delve into two specific issues related to font design and VFX, respectively. Initially, I will discuss a recent work in the field of artistic typography, wherein we employ a diffusion-based generative model to create visually appealing typographic designs. In artistic typography, the fonts employed often embody a specific meaning or convey a relevant message. Following this, I will explore the utilization of machine learning to accelerate various VFX tasks. More specifically, I will address the labor-intensive nature of beauty-related work in the VFX industry, such as aging, de-aging, and acne removal. To combat these challenges, we have developed a system that significantly expedites these tasks, saving valuable time for artists and reducing costs for studios. $$$$ Bio: Ali Mahdavi-Amir is the research director at MARZ VFX and assistant professor at Simon Fraser University. His research focuses on 3D deep learning, geometric modeling, and AI for VFX. He has an extensive publication record with over 40 papers in prestigious conferences and journals such as SIGGRAPH, SIGGRAPH Asia, CVPR, NeurIPS, among others. Additionally, he has contributed to the development of several patents. Notably, at MARZ VFX, he and his team have created Vanity AI, a product utilized in various television shows including "Dr. Death," "Zoey's Extraordinary Playlist," "Being the Ricardos," and more. (TCPL 201)
16:10 - 16:35	Silvia Sellán: Uncertainty Quantification in 3D Reconstruction and Modelling ↓ We study different strategies aimed at quantifying predictive uncertainty from partial observations. We will begin with the simplest example of surface reconstruction from point clouds, where Gaussian Processes can be employed to obtain the probability of each potential reconstruction conditioned on a discrete set of points. Even in this traditional setup, this quantified uncertainty can be used to answer many queries central to the reconstruction process. We will consider different methods to compute similar distributions with neural techniques, with the potential to quantify the uncertainty in learned latent shape spaces that can be used for stochastic shape completion and modelling. $$$$ Bio: Silvia is a fourth year Computer Science PhD student at the University of Toronto. She is advised by Alec Jacobson and working in Computer Graphics and Geometry Processing. She is a Vanier Doctoral Scholar, an Adobe Research Fellow and the winner of the 2021 University of Toronto Arts & Science Dean’s Doctoral Excellence Scholarship. She has interned twice at Adobe Research and twice at the Fields Institute of Mathematics. She is also a founder and organizer of the Toronto Geometry Colloquium and a member of WiGRAPH. She is currently looking to survey potential future postdoc and faculty positions, starting Fall 2024. (TCPL 201)
16:35 - 17:00	Coffe Break (TCPL Lounge)
17:00 - 17:25	Francis Williams: Neural Kernel Surface Reconstruction ↓ Neural fields have become ubiquitous for representing geometric signals in computer vision and graphics. In this talk, we present Neural Kernel Fields, a novel representation of geometric fields. Neural Kernel Fields enable the development of network architectures which are capable of out-of-distribution generalization and can scale to very large problems. We explore Neural Kernel Fields through the lens of 3D surface reconstruction from points, leading to a practical state-of-the-art algorithm. Finally, we show additional applications of neural kernels in textured reconstruction, fluid simulation and neural radiance fields. $$$$ Bio: I am a senior research scientist at NVIDIA in NYC working at the intersection of computer vision, machine learning, and computer graphics. My research is a mix of theory and application, aiming to solve practical problems in elegant ways. In particular, I’m very interested in 3D shape representations which can enable deep learning on “real-world” geometric datasets which are often noisy, unlabeled, and consisting of very large inputs. I completed my PhD from NYU in 2021 where I worked in the Math and Data Group and the Geometric Computing Lab. My advisors were Joan Bruna and Denis Zorin. In addition to doing research, I am the creator and maintainer several open source projects. These include NumpyEigen, Point Cloud Utils, and FML. (TCPL 201)
17:25 - 17:50	Angela Dai: Learning to Model Sparse Geometric Surfaces ↓ With the increasing availability of high quality imaging and even depth imaging now available as commodity sensors, comes the potential to democratize 3D content creation. However, generating 3D shape and scene geometry becomes challenging due to the high dimensionality of the data. In this talk, we explore shape and scene representations that can compactly represent high-fidelity geometric descriptions for generative 3D modeling. In particular, we introduce a new paradigm to learn a shape manifold from optimized neural fields that enables a dimension-agnostic approach for high-dimensional generative modeling. We then aim to learn 3D scene priors as sets of object descriptors as a compact scene representation, without relying on explicit 3D supervision. $$$$ Bio: Angela Dai is an Assistant Professor at the Technical University of Munich where she leads the 3D AI group. Angela's research focuses on understanding how the 3D world around us can be modeled and semantically understood. Previously, she received her PhD in computer science from Stanford in 2018 and her BSE in computer science from Princeton in 2013. Her research has been recognized through an ERC Starting Grant, Eurographics Young Researcher Award, Google Research Scholar Award, ZDB Junior Research Group Award, an ACM SIGGRAPH Outstanding Doctoral Dissertation Honorable Mention, as well as a Stanford Graduate Fellowship. (TCPL 201)
17:50 - 18:15	Abhijit Kundu: Neural Fields for Semantic 3D Scene Understanding ↓ Recent advances in neural scene representations have given us very effective differentiable rendering algorithms that can be used to build 3D representations using analysis-by-synthesis style self-supervision from just images alone. This talk will explore neural scene representations for 3D scene understanding tasks, beyond the usual novel view synthesis task. I will focus most of the talk on two of our recent works called Nerflets and Panoptic Neural Fields (PNF). They provide an efficient neural scene representation that also captures the semantic and instance properties of the scene apart from the 3D structure and appearance. I will also present results from these works on real-world dynamic scenes. We find that our model can be used effectively for several tasks like 2D semantic segmentation, 2D instance segmentation, 3D scene editing, and multiview depth prediction. I’ll also present a brief overview of pre -deep-learning efforts to combine semantics with 3D reconstruction and what I think are still open problems in this domain. $$$$ Bio: Abhijit Kundu is a research scientist at Google working on 3D vision topics. His current research work involves developing next generation perception systems that understands and reasons in 3D about object shape and scene semantics. Before starting at Google Research, Abhijit received his PhD in Computer Science from Georgia Tech co advised by Jim Rehg and Frank Dellaert. Abhijit has published several papers in computer vision and robotics conferences. More details are available on his homepage: https://abhijitkundu.info/ (TCPL 201)
18:00 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Thursday, July 13
07:00 - 08:25	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:30 - 08:55	Thu Nguyen: Neural stylization - Towards creative applications in Mixed Reality ↓ I will present our latest research progress in the transformation of lifelike 3D scenes and avatars into various artistic and creative styles. Imagine the multitude of possibilities with a wide range of 3D filters. You can effortlessly turn your living room into a van Gogh-inspired Impressionist painting, your office into a scene from the Matrix, or even morph your avatar into a spooky Halloween zombie. This research forms an integral part of our long-term effort to develop captivating visual effects for Mixed Reality (MR) applications. Our primary focus is on 3D content, ensuring faster processing and delivering higher-quality outcomes that can be easily scaled up to accommodate millions of users. $$$$ Profile: Thu is a research scientist at Reality Labs Research at Meta. An architect that went rogue, Thu is now interested in machine learning and everything to do with 3D vision and computer graphics. In particular, she works on neural rendering and inverse rendering. (TCPL 201)
08:55 - 09:20	Anurag Ranjan: FaceLit: Neural 3D Relightable Faces ↓ Recent years have seen significant advancements in the improvement of photo-realistic 3D graphics. The advancements in this research field can be broadly categorized into two areas: neural fields, which are capable of modeling photo-realistic 3D representations, and generative models, which are able to generalize to large scale data and produce photo-realistic images. However, physical effects such as illumination or material properties are not factored out in these models restricting their use in the current graphics ecosystem. In this work, we propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of pose and illumination, produces photorealistic face images with multiview 3D and illumination consistency. Our method enables state-of-the-art photorealistic generation of faces with explicit illumination and view controls on multiple datasets – FFHQ, MetFaces and CelebA-HQ. $$$$ Bio: Anurag is a researcher at Apple Machine Learning Research. His interests lie at the intersection of deep learning, computer vision and 3D geometry. He did his PhD at Max Planck Institute for Intelligent Systems with Michael Black with a thesis on Geometric Understanding of Motion. His work links motion of objects in videos to structure and geometry in the world using deep learning paradigms and self-supervised learning. (TCPL 201)
09:20 - 09:45	Katja Schwarz: WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space ↓ Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this talk, I will introduce a new approach to 3D-aware image synthesis, WildFusion, that models instances in view space and is based on latent diffusion models (LDMs). Importantly, WildFusion is trained without any supervision from multiview images or 3D geometry and does not require posed images or learned pose distributions. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. $$ $$ Katja is a 4th-year PhD student in the Autonomous Vision Group at Tuebingen University. She received her bachelor degree in 2016 and master degree in 2018 from Heidelberg University. In July 2019, she started her PhD at Tuebingen University under the supervision of Andreas Geiger. During her PhD, she did an Internship with Sanja Fidler at NVIDIA. Her research lies at the intersection of computer vision and graphics and focuses on generative modeling in 2D and 3D. https://katjaschwarz.github.io (TCPL 201)
09:45 - 10:10	Sergey Tulyakov: Volumetric Animation and Manipulation: ‘Do as I Do’ and ‘Do as I Say’ ↓ In this talk, I will show two key techniques for animating objects: 'Do as I do' and 'Do as I say'. With the first technique, we can bring static objects to life by applying motions from video sequences. With the second, we can instruct an agent to perform an action or achieve a goal. I will first present a new ‘Do as I Do’ 3D animation framework, trained purely on monocular videos. Provided with only a single 2D image of an object, the model can reconstruct its volumetric representation, render novel views, and animate simultaneously. Then, I will introduce the idea of neural game engines, which can manipulate real-world videos of 3D games using text prompts, styles, and camera trajectories in a ‘Do as I Say’ manner, where only a high level command is specified. Neural game engines learn the geometry and physics of a scene and enable game AI, allowing users to specify high-level goals for the game AI to follow. $$$$ Bio: Sergey Tulyakov is a Principal Research Scientist at Snap Inc. He leads the Creative Vision team and focuses on creating methods for transforming the world via computer vision and machine learning. His work includes 2D and 3D synthesis, photorealistic manipulation and animation, video synthesis, prediction and retargeting. Sergey pioneered the unsupervised image animation domain with MonkeyNet and First Order Motion Model that sparked a number of startups in the domain. His work on Interactive Video Stylization received the Best in Show Award at SIGGRAPH Real-Time Live! 2020. He has published 40+ top conference papers, journals and patents resulting in multiple innovative products, including Snapchat Pet Tracking, OurBaby, Real-time Neural Lenses, recent real-time try-on and many others. Before joining Snap Inc., Sergey was with Carnegie Mellon University, Microsoft, NVIDIA. He holds a PhD degree from the University of Trento, Italy. (TCPL 201)
10:05 - 10:45	Coffee Break (TCPL Foyer)
10:45 - 11:10	Lingjie Liu: From 3D Reconstruction to 3D Generation ↓ Abstract: TBD $$ $$ Bio: Lingjie Liu is an Assistant Professor in the Department of Computer and Information Science at the University of Pennsylvania. Before that, she was a Lise Meitner postdoctoral researcher in the Visual Computing and AI Department at Max Planck Institute for Informatics. She obtained her Ph.D. degree from the University of Hong Kong in 2019. Her research interests are Neural Scene Representations, Neural Rendering, Human Performance Modeling and Capture, and 3D Reconstruction. (TCPL 201)
11:10 - 11:35	Aayush Bansal: Treasurable 2D pixels for detailed 3D modeling ↓ In this talk, I will present three experiments that leverage 2D pixels in different ways for detailed 3D modeling. The first experiment use low-level image statistics to efficiently mine hard examples for better learning. Simply biasing ray sampling towards hard ray examples enables learning of neural fields with more accurate high-frequency detail in less time. The second experiment leverages 2D pixels to learn a denoising model from the collection of images. This denoising model enables detailed high-frequency outputs from the model trained on low-resolution samples. The final experiment builds a representation of a pixel that contains color and depth information accumulated from multi-views for a particular location and time along a line of sight. This pixel-based representation alongside a multi-layer perceptron allows us to synthesize novel views given a discrete set of multi-view observations as input. The proposed formulation reliably operates on sparse and wide-baseline multi-view images/videos and can be trained efficiently within a few seconds to 10 minutes for hi-res (12MP) content. $$$$ Bio: Aayush Bansal received his Ph.D. in Robotics from Carnegie Mellon University under the supervision of Prof. Deva Ramanan and Prof. Yaser Sheikh. He was a Presidential Fellow at CMU, and a recipient of the Uber Presidential Fellowship (2016-17), Qualcomm Fellowship (2017-18), and Snap Fellowship (2019-20). His research has been covered by various national and international media such as NBC, CBS, WQED, 90.5 WESA FM, France TV, and Journalist. He has also worked with production houses such as BBC Studios, Full Frontal with Samantha Bee (TBS), etc. More details are available on his webpage: https://www.aayushbansal.xyz/ (TCPL 201)
11:35 - 12:40	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
12:40 - 13:05	Jun Gao: Towards High-Quality 3D Content Creation with a Hybrid Representation ↓ With the increasing demand for creating large-scale 3D digital worlds across various industries, there is an immense need for diverse and high-quality 3D content. Machine learning is existentially enabling this quest. In this talk, I will discuss how exploiting the 3D modeling techniques from computer graphics, as well as the high-resolution 2D diffusion model is a promising avenue for this problem. To this end, I will first introduce a hybrid 3D representation employing marching tetrahedra that converts neural field-based representations into triangular meshes in a differentiable manner to facilitate efficient and flexible 3D modeling. By incorporating differentiable rendering, our representation effectively leverages Stable Diffusion as 2D data prior and moves a step towards generating high-quality 3D content from text prompts. $$$$ Bio: Jun Gao is a PhD student at the University of Toronto advised by Prof. Sanja Fidler. He is also Research Scientist at NVIDIA Toronto AI lab. His research interests focus on the intersection of 3D computer vision and computer graphics, particularly developing machine learning tools to facilitate large-scale and high-quality 3D content creation and drive real-world applications. Many of his contributions have been deployed in products, including NVIDIA Picasso, GANVerse3D, Neural DriveSim and Toronto Annotation Suite. (TCPL 201)
13:05 - 13:30	Jeong Joon Park: Scene-Level 3D Generations ↓ In this talk, I will introduce my recent efforts toward scene-level 3D generations, focusing on 1) unbounded scene generation, 2) generating scene variations, and 3) exploiting compositionality. The first part of the talk is a generative approach to reconstructing unbounded 3D scenes from a few input images, leveraging a new image-based representation and diffusion models. Second, I will discuss a new concept of reconstruction that captures the possible variability of a single scene, ultimately training a 3D generative model on a single scene. Lastly, I will cover my works that utilize the objectness of man-made scenes for effective 3D scene generation. $$$$ Bio: Jeong Joon (JJ) Park is an assistant professor in Computer Science and Engineering at the University of Michigan, Ann Arbor. Previously, he was a postdoctoral researcher at Stanford University, working with Professors Leonidas Guibas and Gordon Wetzstein. His main research interests lie in the intersection of computer vision, graphics, and machine learning, where he studies realistic reconstruction and synthesis of 3D scenes using neural and physical representations. He did his PhD in computer science at the University of Washington, Seattle, under the supervision of Professor Steve Seitz, during which he was supported by Apple AI/ML Fellowship. He is the lead author of DeepSDF, which introduced neural implicit representation and made a profound impact on 3D computer vision. Prior to PhD, he received his Bachelor of Science from the California Institute of Technology. More information can be found on his webpage: https://jjparkcv.github.io/. (TCPL 201)
14:00 - 17:30	Climbing / Hiking / Free time ↓ Activity depending on what was booked by attendee (TCPL Foyer)
18:00 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Friday, July 14
07:00 - 08:00	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:30 - 08:55	Srinath Sridhar: Building Multimodal Datasets for Immersive Neural Fields ↓ Advances in neural fields are enabling high-fidelity capture of the shape and appearance of static and dynamic scenes. However, their capabilities lag behind those offered by representations such as pixels or meshes due to algorithmic challenges and the lack of large-scale real-world datasets. In this talk, I will discuss how to approach the second challenge. I will introduce DiVA-360, a real-world 360° dynamic visuo-audio dataset with synchronized multimodal video, audio, and textual information about table-scale scenes. It contains 46 dynamic scenes, 30 static scenes, and 95 static objects spanning 11 categories captured using a new hardware system with 53 RGB cameras and 6 microphones. This dataset contains 8.6M image frames and 1360 seconds of dynamic data, all with detailed text descriptions, foreground-background segmentation masks, and category-specific 3D pose alignment. I will describe some problems this dataset can help address and metrics for evaluating different methods. $$$$ Bio: Srinath Sridhar (https://srinathsridhar.com) is an assistant professor of computer science at Brown University. He received his PhD at the Max Planck Institute for Informatics and was subsequently a postdoctoral researcher at Stanford. His research interests are in 3D computer vision and machine learning. Specifically, his group focuses on visual understanding of 3D human physical interactions with applications ranging from robotics to mixed reality. He is a recipient of the NSF CAREER award, Google Research Scholar award, and his work received the Eurographics Best Paper Honorable Mention. He spends part of his time as a visiting academic at Amazon Robotics and has previously spent time at Microsoft Research Redmond and Honda Research Institute. (Online)
08:55 - 09:20	Jiatao Gu: Towards Efficient Diffusion Models for 3D Generation ↓ In this talk, we explore the construction of efficient diffusion-based generative models for 3D generation. We address the limitations of costly knowledge distillation from 2D diffusion models by presenting two methods that directly learn diffusion models in 3D space. These methods tackle challenges related to controllability and efficient generation. The first paper, Control3Diff, combines diffusion models and 3D GANs to enable versatile and controllable 3D-aware image synthesis using single-view datasets. The second paper, SSDNeRF, proposes a unified approach for 3D generation by jointly learning a generalizable 3D diffusion prior and performing multi-view image reconstructions. Additionally, we introduce BOOT, a novel distillation technique that further accelerates the diffusion process by distilling a single-step generative model from any pre-trained teacher. Unlike existing approaches, training BOOT is entirely data-free, making it suitable for various large-scale text-to-image applications and advancing the field of efficient and effective 3D generative models. $$$$ Bio: Jiatao Gu is a Machine Learning Researcher at Apple AI/ML (MLR) in New York City. Previously, he worked as a Research Scientist at Meta AI (FAIR Labs). He obtained his Ph.D. degree from the University of Hong Kong and was a visiting scholar at NYU, where he focused on efficient neural machine translation systems. Jiatao Gu's research interests span representation learning and generative AI, encompassing natural language, image, 3D, and speech domains. He is particularly leading research endeavors in developing efficient generative systems for 3D and multimodal environments. (Online)
09:20 - 09:45	Vincent Sitzmann: Towards 3D Representation Learning at Scale ↓ Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. In my talk, I will discuss how this motivates a 3D approach to self-supervised learning for vision. I will then present recent advances of my research group towards enabling us to train self-supervised scene representation learning methods at scale, on uncurated video without pre-computed camera poses. I will further present recent advances towards modeling of uncertainty in 3D scenes, as well as progress on endowing neural scene representations with more semantic, high-level information. $$$$ Bio: Vincent Sitzmann is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he did his Ph.D. at Stanford University as well as a Postdoc at MIT CSAIL. His research interest lies in neural scene representations - the way we can let neural networks learn to reconstruct the state of 3D scenes from vision. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI. (Online)
09:45 - 10:10	Adriana Schulz: Reshaping Design for Manufacturing: The Potential of Generative Models in CAD Systems ↓ In recent years, the emergence of AI-driven tools has significantly revolutionized the design landscape across a myriad of domains. However, the deployment of these transformative tools within the critical field of design for manufacturing remains limited—an area where creativity and practicality intersect, shaping economies and daily life. In this talk, I will delve into the technical challenges specific to this domain, with particular emphasis on Computer-Aided Design (CAD) systems. We will investigate the opportunities that novel generative models present, examining their potential to create next-generation CAD systems and possibly ignite a new revolution in manufacturing-oriented design. $$$$ Bio: Adriana Schulz is an Assistant Professor at the Paul G. Allen School of Computer Science & Engineering at the University of Washington. She is a member of the Computer Graphics Group (GRAIL) and co-director of the Digital Fabrication Center at UW (DFab). Her research group creates innovative manufacturing design systems to revolutionize how physical artifacts are built. Her highly interdisciplinary work combines ideas from geometry processing, machine learning, and formal reasoning, and has been featured in premier venues of computer graphics, computer vision, robotics, and programming languages. Most recently she earned MIT Technology Review's "35 Innovators Under 35" visionary award for her body of work and accomplishments in democratizing fabrication. In addition to her research, Professor Schulz is the founder and chair for the ACM Community Group for Women in Computer Graphics, which supports gender diversity in the field. (Online)
10:10 - 10:20	Andrea Tagliasacchi: Closing remarks (TCPL 201)
10:30 - 11:00	Checkout by 11AM ↓ 5-day workshop participants are welcome to use BIRS facilities (TCPL ) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 11AM. (Front Desk - Professional Development Centre)
11:30 - 13:30	Lunch from 11:30 to 13:30 (Vistas Dining Room)