We don't know, OpenAI refused to publish any details about the architecture in the technical report. We don't know parameters, we don't know depth, we don't know how exactly it's integrating image data (ViT-style maybe?), we don't even know anything about the training data. Right now it's a giant black box.