Understanding the driving forces behind mesoscale cloud organization is fundamental to reducing uncertainties in cloud climate feedbacks. Traditional climate models cannot explicitly resolve mesoscale cloud structures due to their limited resolution, leading to large uncertainties in cloud climate feedback estimates. Storm-resolving models that simulate the atmosphere at kilometre resolution have the potential to reduce these uncertainties. Yet, these models are still biased in their organizational structure when compared to satellite observations. Approaches constraining cloud feedbacks directly from the satellite records are promising but often rely on manually chosen cloud controlling factors (CCFs) that do not necessarily capture all the information necessary to explain mesoscale organizational structures and generally only utilise linear models to predict cloud radiative properties from CCFs.We present CloudDiff, a probabilistic machine learning model that generates mesoscale cloud structures at kilometre resolution conditioned on environmental conditions in the atmosphere, namely the temperature and humidity profiles as well as vertical and horizontal winds. The model is trained on MODIS Level 1 satellite data and environmental conditions from ECMWF ERA5 reanalysis data. CloudDiff is able to reconstruct realistic MODIS observations from matching ERA5 environmental conditions and achieves a lower reconstruction error compared to generating MODIS observations solely from pre-defined CCFs. In CloudDiff’s generation stage, the environmental conditions are compressed into a latent representation using an attention mechanism. This latent representation can be interpreted as a set of CCFs that have been learned purely from data. We’ll discuss the properties of the learned CCFs including how they relate to existing CCFs, their geographical distribution, and their predictive power of the radiative properties of cloud fields.