Zhanqiang Guo1,2∗, Jiamin Wu1,3∗, Yonghao Song2, Jiahui Bu4, Weijian Mai1,5,
Qihao Zheng1, Wanli Ouyang1,3†, Chunfeng Song1†
1Shanghai Artificial Intelligence Laboratory, 2Tsinghua University,
3The Chinese University of Hong Kong, 4Shanghai Jiao Tong University,
5South China University of Technology
Abstract
Human’s perception of the visual world is shaped by the stereo processing of 3D information.Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience.Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues.To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images.Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder.To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and associated code will be made publicly available.
1 Introduction
“The brain is wider than the sky.” — Emily Dickinson
The endeavor to comprehend how the human brain perceives the visual world has long been a central focus of cognitive neuroscience[24, 18].As we navigate through the environment,our perception of the three-dimensional world is shaped by both fine details and the diverse perspectives from which we observe them. This stereo experience of color, depth, and spatial relationships forms complex neural activity in the brain’s cortex.Unraveling how the brain processes 3D perception remains an appealing challenge in neuroscience. Recently, electroencephalography (EEG), a non-invasive neuroimaging technique favored for its safety and ethical suitability, has been widely adopted in 2D visual decoding [77, 7, 8, 28, 41, 33] to generate static visual stimuli.With the aid of EEG and generative techniques, an intriguing question arises: can we directly reconstruct the original 3D visual stimuli from dynamic brain activity?
To address this question, in this paper, we explore a new task, 3D visual decoding from EEG signals, shedding light on the brain mechanisms for perceiving natural 3D objects in the real world. To be specific, this task aims to reconstruct 3D objects from the EEG signals in the form of colored point clouds, as shown in Fig.1.The task involves not only extracting semantic features but also capturing intricate visual cues, e.g., color, shape, and structural information, underlying in dynamic neural signals, all of which are essential for a thorough understanding of 3D visuals.In observing the surrounding world, humansform 3D perception through shifting views of objects in continuous movement overtime.EEG provides an effective means of tracking neural dynamics in this evolving perceptual process for the 3D decoding task, owing to its high temporal resolution with millisecond precision[12, 16]. This propertydistinguishes it from other neuroimaging techniques like fMRI with high spatial resolution but extremely low temporal resolution of few seconds[12].Furthermore, as EEG offers the advantages of cost-effectiveness and portability, EEG-based 3D visual decoding research could be employed in real-time applications such as in clinical scenarios[44, 45].
However, when delving into this task, two critical challenges need to be addressed.(1) Limited data availability:Currently, there is no publicly available dataset that provides paired EEG signals and 3D stimulus data. (2)Complexity of neural representation:The neural representations are inherently complex[23].This complexity is amplified by low signal-to-noise ratio of non-invasive neuroimaging techniques, making it challenging to learn robust neural representation and recover complex 3D visual cues from brain signals.Thus, how to construct a robust 3D visual decoding framework is a critical issue.
To address the first challenge, we develop a new EEG dataset, named EEG-3D dataset, comprising paired EEG signals collected from 12 participants while watching 72 categories of 3D objects. To create diverse 3D stimuli, we select a subset of common objects from the Objaverse dataset [10, 76]. Previous works [14, 58] have revealed that 360-degree rotating videos effectively represent 3D objects. Thus, we capture rotational videos of colored 3D objects to serve as visual stimuli, as shown in Fig.1.Compared to existing datasets [26, 5, 74, 14, 29, 16, 20, 1], EEG-3D dataset offers several distinctive features:(1)Comprehensive EEG signals in diverse states.In addition to EEG signals from video stimuli, our dataset includes signals from static images and resting-state activity, providing diverse neural responses and insights into brain perception mechanisms across dynamic and static scenes.(2)Multimodal analysis data with high-quality annotations. The dataset comprises high-resolution videos, static images, text captions, and corresponding 3D objects with geometry and color details, supporting a wide range of visual decoding and analyzing tasks.
Building upon the EEG-3D dataset,we introduce an EEG-based 3D visual decoding framework, termed as Neuro-3D, to reconstruct 3D visual cues from complex neural signals.We first propose a Dynamic-Static EEG-Fusion Encoder to extract robust and discriminative EEG features against noises.Given EEG recordings evoked from dynamic and static stimuli, we design an attention based neural aggregator to adaptively fuse different EEG signals, exploiting their complementary characteristics to extract robust neural representation.Subsequently, to recover 3D perception from EEG embedding, we propose a Colored Point Cloud Decoder, with the first stage generating the shape and the second stage assigning colors to the generated point clouds. To enhance precision in the generation process, we further decouple the EEG embedding into distinct geometry and appearance components, enabling targeted conditioning of shape and color generation.To learn discriminative and semantically meaningful EEG features,we align them with visual features of observed videos through contrastive learning [73].Finally, utilizing the aligned geometry feature as condition, a 3D diffusion model is applied to generate the point cloud of the 3D object, which is then combined with appearance EEG feature for color prediction.Our main contributions can be summarized as follows:
- •
We are the first to explore the task of 3D visual decoding from EEG signals, which serves as a critical step for advancing neuroscience research into the brain’s 3D perceptual mechanism.
- •
We present EEG-3D, a pioneering dataset accompanied by both multimodal analysis data and comprehensive EEG recordings from 12 subjects watching 72 categories of 3D objects.This dataset fills a crucial gap in 3D-stimulus neural data for the computer vision and neuroscience communities.
- •
We propose Neuro-3D, a 3D visual decoding framework based on EEG signals. A diffusion-based colored point cloud decoder is proposed to recover both shape and color characteristics of 3D objects from adaptively fused EEG features captured under static and dynamic 3D stimuli.
- •
The experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representation that enables insightful brain region analysis.
2 Related Work
2.1 2D Visual Decoding from Brain Activity
Visual decoding from brain activity [33, 77, 8, 7, 9] has gained substantial attention in computer vision and neuroscience, emerging as an effective technique for understanding and analyzing human visual perception mechanisms.Early approaches in this area predominantly utilized Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) to model brain activity signals and interpret visual information[26, 62, 4, 37, 48].Recently, the utilization of newly-emerged diffusion models [25, 47] and vision-language models [36, 81] has advanced visual generation from various neural signals including fMRI [8, 69, 60, 67, 61]and EEG [65, 2, 35, 63].These methods typically perform contrastive alignment [73] betweenneural signal embeddings and image or text features derived from the pre-trained CLIP model[51]. Subsequently, the aligned neural embeddings are sent into diffusion model to conditionally reconstruct images that correspond to the visually-evoked brain activity.Apart from static images,research has begun to extend these approaches to the reconstruction of video information from fMRI data, further advancing the field [9, 66, 74, 72, 32].Though impressive, these methods, limited to 2D visual perception, fall short of capturing the full depth of human 3D perceptual experience in real-world environments. Our method attempts to expand the scope of brain visual decoding to three dimensions by reconstructing 3D objects from real-time EEG signals.
2.2 3D Reconstruction from fMRI
Reconstructing 3D objects from brain signals holds significant potential for advancing both brain analysis applications and our understanding of the brain’s visual system. To achieve this goal, several works[14, 13] have made initial strides in 3D object reconstruction from fMRI, yielding promising results in interpreting 3D spatial structures.Mind-3D[14] proposes the first dataset of paired fMRI and 3D shape data and develops a diffusion-based framework to decode 3D shape from fMRI signals. A subsequent work, fMRI-3D[13], expanded the dataset to include a broader range of categories across five subjects.
However, there are several limitations in previous task setup, which prevent it from simulating real-time, natural 3D perception scenarios. First, fMRI equipment is unportable, expensive, and difficult to operate, potentially hindering its application in BCIs (Brain-Computer-Interface). Besides its high acquisition cost, fMRI is limited by its inherently low temporal resolution, which hinders real-time responsiveness to dynamic stimuli. Second, existing brain 3D reconstruction methods focus exclusively on reconstructing 3D shape of objects, neglecting color and appearance information that are crucial in real-world perception.To address these challenges,we introduce a 3D visual decoding framework based on EEG signals, along witha new dataset for paired EEG and colored objects.To the best of our knowledge, this is the first work to interpret 3D objects from EEG signals, offering a comprehensive dataset, benchmarks, and decoding framework.
2.3 Diffusion Models
Diffusion model has recently emerged as a powerful generative framework known for high-quality image synthesis capabilities. Inspired by non-equilibrium thermodynamics, the diffusion models are formulated as Markov chains. The model first progressively corrupts the target data distribution by adding noise until it conforms to a standard Gaussian distribution, and subsequently generates samples by predicting and reversing the noise process through network learning [25, 47]. The diffusion model, along with its variants, has been extensively applied to tasks such as image generation [54, 30, 56, 52] and image editing [78, 79].
Building on advancements in 2D image generation, Luo et al. [42] and Zhou et al. [80] extended pixel-based approaches to 3D coordinates, enabling the generation of point clouds. This has spurred further research into 3D generation [53, 70], text-to-3D reconstruction [49, 46], and 2D-to-3D generation [43, 75, 40], demonstrating their capability to capture intricate spatial structures and textures of 3D objects.In our study, we extend the 3D diffusion model to brain activity analysis, reconstructing colored 3D objects from EEG signals.
3 EEG-3D Dataset
In this section, we introduce the detailed procedures for building the EEG-3D dataset.
3.1 Participants
We recruited healthy adult participants ( males, females; mean age: years) for the study. All participants have normal or corrected-to-normal vision. Informed written consent was obtained from all individuals after a detailed explanation of the experimental procedures. Participants received monetary compensation for their involvement. The study protocol was reviewed and approved by the Ethics Review Committee.
3.2 Stimuli
The stimuli employed in this study were derived from the Objaverse dataset [10, 76], which offers an extensive collection of common 3D object models. We selected 72 categories with different shapes, each containing 10 objects accompanied by text captions. For each category, 8 objects were randomly allocated to the training set, while the remaining 2 were reserved for the test set.Additionally, we assigned objects with color type labels, dividing them into six categories according to their main color style.To generate the visual stimuli, we followed the procedure in Zero-123 [38] to use Blender to simulate a camera that captured 360-degree views of each object through incremental rotations, yielding 180 high-resolution images (1024 1024 pixels) per object. The objects were tilted at an optimal angle to provide comprehensive perspectives.
Rotating 3D object videos offer multi-perspective views, capturing the overall appearance of 3D objects. However, the prolonged duration of such videos, coupled with factors such as eye movements, blink artifacts, task load and lack of focus, often leads to EEG signals with a lower signal-to-noise ratio. In contrast, static image stimuli provide single-perspective but more stable information, which can serve to complement the dynamic EEG signals by mitigating their noise impact. Therefore, we collected EEG signals for both dynamic video and static image stimulus. The stimulus presentation paradigm is shown in Fig. 2.Specifically, the multi-view images were compiled into a 6-second video at 30 Hz. Each object stimulus block consisted of a 8-second sequence of events: a 0.5-second static image stimulus at the beginning and end, a 6-second rotating video, and a brief blank screen transition between each segment.During each experimental session, a 3D object was randomly selected from each category with a 1-second fixation cross between object blocks to direct participants’ attention. Participants manually initiated each new object presentation.Training set objects had measurement repetitions, while test set objects had , resulting in totaling sessions.Participants took 2-3 minute breaks between sessions.Following established protocols [16], 5-minute resting-state data were recorded at the start and end of all sessions to support further analysis. Each participant’s total experiment time was approximately 5.5 hours, divided into two acquisitions.
3.3 Data Acquisition and Preprocessing
During the experiment, images and videos were presented on a screen with a resolution of 1920 × 1080 pixels. Participants were seated approximately 95 cm from the screen, ensuring that the stimuli occupied a visual angle of approximately 8.4 degrees to optimize perceptual clarity. EEG data were recorded using a 64-channel EASYCAP equipped with active silver chloride electrodes, adhering to the international 10-10 system for electrode placement. Data acquisition was conducted at a sampling rate of 1000 Hz.Data preprocessing was performed using MNE [17], and more details are shown in Supplementary Material.
Dataset Brain Activity Analysis Data Re St Dy Img Vid 3D (S) 3D (C) Text GOD [26] ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ BOLD5000 [5] ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ NSD [1] ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✓ Video-fMRI [74] ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ Mind-3D [14] ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ImgNet-EEG [29] ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ Things-EEG [16] ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗ EEG-3D ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
3.4 Dataset Attributes
Tab. 1 presents a comparison between EEG-3D and other commonly used datasets [26, 5, 74, 14, 29, 16, 20, 1]. Our dataset addresses the gap in the field of extracting 3D information from EEG signals. The EED-3D dataset distinguishes itself from existing datasets by following attributes:
- •
Comprehensive EEG signal recordings. Our dataset includes resting EEG data, EEG responses to static image stimuli and dynamic video stimuli. These signals enable more comprehensive investigations into neural activity, particularly in understanding the brain’s response mechanisms to 3D visual stimuli, as well as comparative analyses of how the visual processing system engages with different types of visual input.
- •
Multimodal analysis data and labels. EEG-3D dataset includes static images, high-resolution videos, text captions and3D shape with color attributesaligned with EEG. Each 3D object is annotated with category labels and main color style labels. This comprehensive dataset, with multimodal analysis data and labels, supports a broad range of EEG signal decoding and analysis tasks.
These attributes provide a strong basis for exploring brains response mechanisms to dynamic and static stimuli,positioning the dataset as a valuable resource for advancing research in neuroscience and computer vision.
4 Method
4.1 Overview
As depicted in Fig. 3, we delineate our framework into two principal components:1) Dynamic-Static EEG-fusion Encoder: Given the static and dynamic EEG signals ( and ) from EEG-3D, the encoder is responsible to extract discriminative neural features by adaptively aggregating dynamic and static EEG features, leveraging their complementary characteristics.2) Colored Point Cloud Decoder: To reconstruct 3D objects, a two-stage decoder module is proposed to generate 3D shape and color sequentially, conditioned on decoupled geometry and appearance EEG features ( and ), respectively.
4.2 Dynamic-Static EEG-fusion Encoder
Given EEG recordings under the static and dynamic 3D visual stimuli, how to extract robust and discriminative neural representations becomes a critical issue.EEG signals have inherent high noise levels and prolonged exposure to rapidly changing video stimuli introduces further interference factors.To address this challenge, we propose to adaptively fuse dynamic and static EEG signals for learning comprehensive and robust neural representation.
EEG Embedder.Given preprocessed EEG signals and under static image stimuli (the initial frame of the video) and dynamic video stimuli with rotational 3D object.We design two EEG embedders, and , to extract static and dynamic EEG features from and , respectively:
(1) |
Specifically,the embedders consist of multiple temporal self-attention layersthat apply self-attention [71] mechanism along the EEG temporal dimension.They capture and integrate temporal dynamics of brain responses over the duration of the stimulus.Subsequently, an MLP projection layer is applied to generate output EEG embeddings.
Neural Aggregator.The static image stimulus, with a duration of 0.5 seconds, helps the subject capture relatively stable single-view information about the 3D object.In contrast, dynamic video stimulation renders a holistic 3D representation with rotating views of the object, but its long duration may introduce additional noise.To leverage the complementary characteristics,we introduce an attention-based neural aggregator to integrate static and dynamic EEG embeddings in an adaptive way. Specifically, query features are derived from static EEG features , while key and value features are obtained from dynamic EEG features :
(2) |
The attention-based aggregation can be defined as follows:
(3) |
where is the aggregated EEG feature.The attentive aggregation approach leverages the stability provided by the static image responses and the temporal dependencies inherent in video data, enabling robust and comprehensive neural representation learning against high signal noises.
4.3 Colored Point Cloud Decoder
To recover 3D experience from neural representations, we propose a colored point cloud decoder that first generates the shape and then assigns colors to the generated point cloud, conditioned on the decoupled EEG representations.
Decoupled Learning of EEG Features.Directly using the same EEG feature for the two generation stages may result in information interference and redundance.Therefore, to enable targeted conditioningof shape and color generation,we learn distinct geometryand appearance components from EEG embeddings in a decoupled way.Given the EEG feature extracted from EEG-fusion encoder, we decouple it into distinct embeddings for geometry and appearance features ( and ) through individual MLP projection layers.To learn discriminative and semantically meaningful EEG features,we align them with video features encoded by pre-trained CLIP vision encoder through a contrastive loss and a MSE loss:
(4) |
(5) |
where represents or , and denotes downsampled video sequence.To enhance the learning of geometry and appearance features, a categorical loss is proposed to ensure the decoupled geometry and appearance features can be correctly classified as ground-truth color and shape categories:
(6) |
where and are the shape and color predictions produced by linear classifiers, and denotes ground-truth labels, and denotes cross-entropy loss.The final loss integrates alignment loss and categorical loss:
(7) |
Subsequently, and are respectively sent into shape generation and color generation streams for precise brain visual interpretation and reconstruction.
Shape Generation.The point cloud associated with the stimulus signal is incrementally added noise until it converges to an isotropic Gaussian distribution. The noise addition follows a Markov process, characterized by Gaussian transitions with variance scheduled by hyperparameters , defined as:
(8) |
The cumulative noise introduction aligns with the Markov chain assumption, enabling derivation of:
(9) |
Our objective is to generate the 3D point cloud based on the geometry EEG features . This is achieved through a reverse diffusion process, which reconstructs corrupted data by modeling the posterior distribution at each diffusion step. The transition from the Gaussian state back to the initial point cloud can be represented as:
(10) |
(11) |
where the parameterized network is a learnable model to iteratively predict the reverse diffusion steps, ensuring that the generated reverse process closely approximates the forward process. To optimize network training, the diffusion model employs the principle of variational inference, maximizing the variational lower bound of the negative log-likelihood, ultimately yielding a loss function expressed as:
(12) |
Color Generation.Previous researches in point cloud generation suggest that jointly generating geometry and color information often leads to performance degradation and model complexity [43, 75].Therefore, following the work in [43], we learn a separate single-step coloring model to reconstruct object color in addition to object shape.Specifically, we use the generated point cloud with appearance EEG features as the condition, and send them to coloring model to estimate the color of the point cloud.Due to the limited information provided by EEG signals, predicting distinct colors for each point in a 3D structure presents a significant challenge. As an initial step in addressing this issue,we simplify the task by aggregating color information from the ground-truth point cloud. Through a majority-voting mechanism, we select dominant colors to represent the entire object, thereby reducing the complexity of the color prediction process.
5 Experiments
5.1 Experimental Setup
Implementation Details. We utilize the AdamW optimizer [31] with and an initial learning rate of . The loss coefficients and in Eq.(4) and Eq.(7) are set to 0.01 and 0.1, respectively. The dimension of the extracted features ( and ) is . The point cloud consists of points, and each video sequence is downsampled to frames for feature extraction to facilitate alignment with EEG features. Our method is implemented in PyTorch on a single A100 GPU. In colored point cloud decoder, Point-Voxel Network (PVN)[39] is used as the denoising function of shape diffusion model and single-step color prediction model.
Evaluation Benchmarks. To thoroughly evaluate the 3D decoding performance on EEG-3D, we construct two evaluation benchmarks: 3D visual classification benchmark for evaluating the EEG encoder, and 3D object reconstruction benchmark for assessing the 3D reconstruction pipeline.(1) 3D visual classification benchmark. To assess the high-level visual semantic decoding performance on EEG signals,we evaluate on two classification tasks: object classification (72 categories) and color type classification (6 categories).We select top-K accuracies as the evaluation metric for these tasks.(2) 3D object reconstruction benchmark.Following 2D visual decoding methods[35, 9, 8], we adopt N-way top-K accuracy to assess the semantic fidelity of generated 3D objects.Specifically, we train an additional classifier to predict objects categories of point clouds, with training data derived from the Objaverse dataset [10, 76].The evaluation metrics include 2-way top-1 and 10-way top-3 accuracies, calculated from the average across five generation results as well as the best-performing result in each case.Further details on the evaluation protocol are provided in the Supplementary Material.
5.2 Classification Task
We assess the performance of the proposed dynamic-static EEG-fusion encoder on the classification tasks.
5.2.1 Comparison with Related Methods
We re-implement several state-of-the-art EEG encoders[59, 34, 64, 65] for comparative analysis by training separate object and color classifiers.Tab. 2 presents the overall accuracy of various EEG classifiers. All methods exceed chance-level performance by a significant margin, suggesting that the collected EEG signals successfully capture the visual perception processed in the brain. Notably, our proposed EEG-fusion encoder outperforms all baseline methods across all metrics, demonstrating its superior ability in extracting semantically meaningful and discriminative neural representations related with high-level visual perception.
Method | Object Type | Color Type | ||
top-1 | top-5 | top-1 | top-2 | |
Chance level | 1.39 | 6.94 | 16.67 | 33.33 |
DeepNet (2017) [59] | 3.70 | 9.90 | 20.95 | 49.71 |
EEGNet (2018) [34] | 3.82 | 9.72 | 18.35 | 46.47 |
Conformer (2023) [64] | 4.05 | 10.30 | 18.27 | 35.81 |
TSConv (2024) [65] | 4.05 | 10.13 | 31.13 | 59.49 |
Neuro-3D | 5.91 | 16.30 | 39.93 | 61.40 |
St. | Dy. | Agg. | Object Type | Color Type | ||
top-1 | top-5 | top-1 | top-2 | |||
✓ | ✗ | ✗ | 5.10 | 15.62 | 37.50 | 57.64 |
✗ | ✓ | ✗ | 4.75 | 13.89 | 35.65 | 55.61 |
✓ | ✓ | ✗ | 5.44 | 15.86 | 39.12 | 58.85 |
✓ | ✓ | ✓ | 5.91 | 16.30 | 39.93 | 61.40 |
5.2.2 Ablation Study
We conduct an ablation study to assess the impact of using different EEG signals and modules, as shown in Tab. 3. Compared to using only dynamic features, the performance improves when static features are incorporated. This enhancement may be attributed to the longer duration of the video stimulus, during which factors such as blinking and distraction introduce noise into the dynamic signal, thereby reducing its effectiveness. When the static and dynamic features are simply concatenated, the performance improves compared to using either signal alone, suggesting complementary information between the two signals. Further performance gains are achieved through our attention-based neural aggregator, which adaptively integrates the dynamic and static features. This demonstrates that our method can leverage the information from both EEG features while mitigating the challenges posed by the low signal-to-noise ratio inherent in EEG, thereby enhancing model robustness.
Method | Average | Top-1 of 5 samples | ||
2-w, t-1 | 10-w, t-3 | 2-w, t-1 | 10-w, t-3 | |
Static | 51.64 | 32.39 | 68.75 | 55.14 |
Dynamic | 50.86 | 31.50 | 71.25 | 54.30 |
Concat | 53.22 | 34.11 | 69.72 | 56.53 |
w/o De. | 53.94 | 34.42 | 65.00 | 48.54 |
Full | 55.81 | 35.89 | 72.08 | 57.64 |
5.3 3D Reconstruction Task
5.3.1 Quantitative Results
Quantitative evaluation results of various baseline models and our proposed Neuro-3D model are presented in Tab.4. The generation performance is notably reduced when employing static or dynamic EEG features in isolation,particularly with dynamic features alone, potentially due to the increased noise levels inherent to dynamic EEG signals.Static EEG features offer stability yet lack sufficient 3D details, whereas dynamic video features provide a more comprehensive 3D representation but suffer from a lower signal-to-noise ratio.Integrating static and dynamic features leads to more comprehensive and stable neural representation, thereby enhancing the generation performance.Furthermore, compared to direct feature concatenation, our proposed neural aggregator effectively merges static and dynamic information, reducing noise interference and further improving reconstruction performance. The decoupling of shape and color features minimizes cross-feature interference, yielding significant advancements in 3D generation quality. Additionally, comparisons between Tab.4 and Tab.3 reveal a positive correlation between generation quality and classification accuracy, confirming thatenriching features with high-level semantics enhances visual reconstruction performance.
5.3.2 Reconstructed Examples
Fig.4 presents the generated results produced by Neuro-3D and the corresponding ground truth objects. The results demonstrate that Neuro-3D not only successfully reconstructs simpler objects such as kegs and potteries but also performs well with more complex structures (such as elephants and horses), underscoring the model’s robust shape perception capabilities. In terms of color generation, while the low spatial resolution of EEG signals poses challenges for detailed texture synthesis, our method effectively captures color styles that closely resemble those of the actual objects. Further results and an analysis of failure cases are provided in the Supplementary Material.
5.4 Analysis of Brain Regions
To examine the contribution of different brain regions to 3D visual perception, we generated saliency maps for 3 subjects by sequentially removing each of the 64 electrode channels, as illustrated in Fig. 5(a). Notably, the removal of occipital electrodes presents the most significant effect on performance, as this region is strongly linked to the brain’s visual processing pathways. This finding aligns with the previous neuroscience discoveries regarding the brain’s visual processing mechanisms[19, 65, 35]. Moreover, previous studies have identified the inferior temporal cortex in the temporal lobe as crucial for high-level semantic processing and object recognition[11, 3]. Consistent with this, the results shown in Fig.5(a) suggest a potential correlation between visual decoding performance and this brain region. A comparative analysis of classification results across different subjects reveals substantial variability in EEG signals between individuals. For a more in-depth examination, please refer to the Supplementary Material.
We further assess the visual decoding performance by sequentially removing electrodes from five distinct brain regions. As shown in Fig.5(b), removal of electrodes from the occipital or temporal regions led to a marked decrease in performance, which is consistent with our expectations. Additionally, removing electrodes from the temporal or parietal regions results in a more pronounced performance decline for dynamic stimuli, compared to static stimuli. This effect is likely attributed to the involvement of the dorsal visual pathway, which is responsible for motion perception and runs from middle temporal visual area, medial superior temporal area and ventral intraparietal cortex in the parietal lobe [57, 21, 6].
6 Discussion and Conclusion
Limitations and Future Work. A limitation of our study is the simplification of texture generation to the main color style prediction due to the complexities of detailed texture synthesis. Extending this work to generate complete 3D textures is a key focus for future research. Moreover, given the substantial individual variations in EEG, future work should extend to enhance cross-subject generalization.
Conclusion. We explore a new task of reconstructing colored 3D objects from EEG signals, which is challenging but holds considerable importance for understanding the brain’s mechanisms for real-time 3D perception. To facilitate this task, we develop the EEG-3D dataset, which integrates multimodal data and extensive EEG recordings. This dataset addresses the scarcity of EEG-3D object pairings, providing a valuable resource for future research in this domain. Furthermore, we propose a new framework, Neuro-3D, for extracting EEG-based visual features and reconstructing 3D objects. Neuro-3D leverages a diffusion-based 3D decoder for shape and color generation, conditioned on adaptively fused EEG features captured under static and dynamic 3D stimuli.Extensive experiments demonstrate the feasibility of decoding 3D information from EEG signals and confirms the alignment between EEG visual decoding and biological visual perception mechanisms.
References
- Allen etal. [2022]EmilyJ Allen, Ghislain St-Yves, Yihan Wu, JesseL Breedlove, JacobS Prince,LoganT Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest,etal.A massive 7t fmri dataset to bridge cognitive neuroscience andartificial intelligence.Nature Neuroscience, 25(1):116–126, 2022.
- Bai etal. [2023]Yunpeng Bai, Xintao Wang, Yan-pei Cao, Yixiao Ge, Chun Yuan, and Ying Shan.Dreamdiffusion: Generating high-quality images from brain eegsignals.arXiv preprint arXiv:2306.16934, 2023.
- Bao etal. [2020]Pinglei Bao, Liang She, Mason McGill, and DorisY Tsao.A map of object space in primate inferotemporal cortex.Nature, 583(7814):103–108, 2020.
- Beliy etal. [2019]Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, and MichalIrani.From voxels to pixels and back: Self-supervision in natural-imagereconstruction from fMRI.Advances in Neural Information Processing Systems, 32, 2019.
- Chang etal. [2019]Nadine Chang, JohnA Pyles, Austin Marcus, Abhinav Gupta, MichaelJ Tarr, andElissaM Aminoff.BOLD5000, a public fMRI dataset while viewing 5000 visual images.Scientific Data, 6(1):49, 2019.
- Chen etal. [2011]Aihua Chen, GregoryC DeAngelis, and DoraE Angelaki.Representation of vestibular and visual cues to self-motion inventral intraparietal cortex.Journal of Neuroscience, 31(33):12036–12052, 2011.
- Chen etal. [2024a]Yuqi Chen, Kan Ren, Kaitao Song, Yansen Wang, Yifan Wang, Dongsheng Li, andLili Qiu.Eegformer: Towards transferable and interpretable large-scale eegfoundation model.arXiv preprint arXiv:2401.10278, 2024a.
- Chen etal. [2023]Zijiao Chen, Jiaxin Qing, Tiange Xiang, WanLin Yue, and JuanHelen Zhou.Seeing beyond the brain: Masked modeling conditioned diffusionmodel for human vision decoding.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, 2023.
- Chen etal. [2024b]Zijiao Chen, Jiaxin Qing, and JuanHelen Zhou.Cinematic mindscapes: High-quality video reconstruction from brainactivity.Advances in Neural Information Processing Systems, 36,2024b.
- Deitke etal. [2023]Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, EliVanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and AliFarhadi.Objaverse: A universe of annotated 3D objects.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 13142–13153, 2023.
- DiCarlo and Cox [2007]JamesJ DiCarlo and DavidD Cox.Untangling invariant object recognition.Trends in Cognitive Sciences, 11(8):333–341, 2007.
- Engel etal. [1994]StephenA Engel, DavidE Rumelhart, BrianA Wandell, AdrianT Lee, GaryHGlover, Eduardo-Jose Chichilnisky, MichaelN Shadlen, etal.fMRI of human visual cortex.Nature, 369(6481):525–525, 1994.
- Gao etal. [2024a]Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, and Yanwei Fu.fmri-3d: A comprehensive dataset for enhancing fmri-based 3dreconstruction.arXiv preprint arXiv:2409.11315, 2024a.
- Gao etal. [2024b]Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, and Yanwei Fu.Mind-3D: Reconstruct high-quality 3D objects in human brain.In European Conference on Computer Vision, 2024b.
- Gibson etal. [2022]Erin Gibson, NancyJ Lobaugh, Steve Joordens, and AnthonyR McIntosh.Eeg variability: Task-driven or subject-driven signal of interest?NeuroImage, 252:119034, 2022.
- Gifford etal. [2022]AlessandroT Gifford, Kshitij Dwivedi, Gemma Roig, and RadoslawM Cichy.A large and rich EEG dataset for modeling human visual objectrecognition.NeuroImage, 264:119754, 2022.
- Gramfort etal. [2013]Alexandre Gramfort, Martin Luessi, Eric Larson, DenisA Engemann, DanielStrohmeier, Christian Brodbeck, Roman Goj, Mainak Jas, Teon Brooks, LauriParkkonen, etal.MEG and EEG data analysis with MNE-Python.Frontiers in Neuroinformatics, 7:267, 2013.
- Grill-Spector and Malach [2004]Kalanit Grill-Spector and Rafael Malach.The human visual cortex.Annual Review of Neuroscience, 27(1):649–677, 2004.
- Grill-Spector etal. [2001]Kalanit Grill-Spector, Zoe Kourtzi, and Nancy Kanwisher.The lateral occipital complex and its role in object recognition.Vision Research, 41(10-11):1409–1422,2001.
- Grootswagers etal. [2022]Tijl Grootswagers, Ivy Zhou, AmandaK Robinson, MartinN Hebart, and ThomasACarlson.Human EEG recordings for 1,854 concepts presented in rapid serialvisual presentation streams.Scientific Data, 9(1):3, 2022.
- Gu etal. [2012]Yong Gu, GregoryC DeAngelis, and DoraE Angelaki.Causal links between dorsal medial superior temporal area neurons andmultisensory heading perception.Journal of Neuroscience, 32(7):2299–2313,2012.
- Guggenmos etal. [2018]Matthias Guggenmos, Philipp Sterzer, and RadoslawMartin Cichy.Multivariate pattern analysis for meg: A comparison of dissimilaritymeasures.NeuroImage, 173:434–447, 2018.
- Hebb [2005]DonaldOlding Hebb.The organization of behavior: A neuropsychological theory.Psychology press, 2005.
- Hendee and Wells [1997]WilliamR Hendee and PeterNT Wells.The perception of visual information.Springer Science & Business Media, 1997.
- Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems,33:6840–6851, 2020.
- Horikawa and Kamitani [2017]Tomoyasu Horikawa and Yukiyasu Kamitani.Generic decoding of seen and imagined objects using hierarchicalvisual features.Nature Communications, 8(1):15037, 2017.
- Huang etal. [2023]Gan Huang, Zhiheng Zhao, Shaorong Zhang, Zhenxing Hu, Jiaming Fan, Meisong Fu,Jiale Chen, Yaqiong Xiao, Jun Wang, and Guo Dan.Discrepancy between inter-and intra-subject variability in eeg-basedmotor imagery brain-computer interface: Evidence from multiple perspectives.Frontiers in Neuroscience, 17:1122661, 2023.
- Jiang etal. [2024]Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu.Large brain model for learning generic representations withtremendous eeg data in bci.In International Conference on Learning Representations, 2024.
- Kavasidis etal. [2017]Isaak Kavasidis, Simone Palazzo, Concetto Spampinato, Daniela Giordano, andMubarak Shah.Brain2image: Converting brain signals into images.In Proceedings of the 25th ACM International Conference onMultimedia, pages 1809–1817, 2017.
- Kim etal. [2022]Gwanghyun Kim, Taesung Kwon, and JongChul Ye.Diffusionclip: Text-guided diffusion models for robust imagemanipulation.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 2426–2435, 2022.
- Kingma and Ba [2014]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
- Kupershmidt etal. [2022]Ganit Kupershmidt, Roman Beliy, Guy Gaziv, and Michal Irani.A penny for your (visual) thoughts: Self-supervised reconstructionof natural movies from brain activity.arXiv preprint arXiv:2206.03544, 2022.
- Lahner etal. [2024]Benjamin Lahner, Kshitij Dwivedi, Polina Iamshchinina, Monika Graumann, AlexLascelles, Gemma Roig, AlessandroThomas Gifford, Bowen Pan, SouYoung Jin,NApurva RatanMurty, etal.Modeling short visual events through the bold moments video fmridataset and metadata.Nature Communications, 15(1):6241, 2024.
- Lawhern etal. [2018]VernonJ Lawhern, AmeliaJ Solon, NicholasR Waytowich, StephenM Gordon,ChouP Hung, and BrentJ Lance.EEGNet: a compact convolutional neural network for EEG-basedbrain–computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018.
- Li etal. [2024]Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu.Visual decoding and reconstruction via EEG embeddings with guideddiffusion.In Advances in Neural Information Processing Systems, 2024.
- Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen imageencoders and large language models.In International Conference on Machine Learning, pages19730–19742. PMLR, 2023.
- Lin etal. [2022]Sikun Lin, Thomas Sprague, and AmbujK Singh.Mind reader: Reconstructing complex images from brain activities.Advances in Neural Information Processing Systems,35:29624–29636, 2022.
- Liu etal. [2023]Ruoshi Liu, Rundi Wu, Basile VanHoorick, Pavel Tokmakov, Sergey Zakharov, andCarl Vondrick.Zero-1-to-3: Zero-shot one image to 3D object.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 9298–9309, 2023.
- Liu etal. [2019]Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han.Point-voxel cnn for efficient 3d deep learning.Advances in Neural Information Processing Systems, 32, 2019.
- Long etal. [2024]Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu,Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, etal.Wonder3D: Single image to 3D using cross-domain diffusion.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9970–9980, 2024.
- Luo etal. [2024]Andrew Luo, Maggie Henderson, Leila Wehbe, and Michael Tarr.Brain diffusion for visual exploration: Cortical discovery usinglarge scale generative models.Advances in Neural Information Processing Systems, 36, 2024.
- Luo and Hu [2021]Shitong Luo and Wei Hu.Diffusion probabilistic models for 3D point cloud generation.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 2837–2845, 2021.
- Melas-Kyriazi etal. [2023]Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi.Pc2: Projection-conditioned point cloud diffusion for single-image3D reconstruction.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 12923–12932, 2023.
- Metzger etal. [2023]SeanL Metzger, KayloT Littlejohn, AlexanderB Silva, DavidA Moses,MargaretP Seaton, Ran Wang, MaximilianE Dougherty, JessieR Liu, Peter Wu,MichaelA Berger, etal.A high-performance neuroprosthesis for speech decoding and avatarcontrol.Nature, 620(7976):1037–1046, 2023.
- Moses etal. [2021]DavidA Moses, SeanL Metzger, JessieR Liu, GopalaK Anumanchipalli, JosephGMakin, PengfeiF Sun, Josh Chartier, MaximilianE Dougherty, PatriciaM Liu,GaryM Abrams, etal.Neuroprosthesis for decoding speech in a paralyzed person withanarthria.New England Journal of Medicine, 385(3):217–227, 2021.
- Nichol etal. [2022]Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen.Point-e: A system for generating 3D point clouds from complexprompts.arXiv preprint arXiv:2212.08751, 2022.
- Nichol and Dhariwal [2021]AlexanderQuinn Nichol and Prafulla Dhariwal.Improved denoising diffusion probabilistic models.In International Conference on Machine Learning, pages8162–8171. PMLR, 2021.
- Ozcelik etal. [2022]Furkan Ozcelik, Bhavin Choksi, Milad Mozafari, Leila Reddy, and RufinVanRullen.Reconstruction of perceived images from fMRI patterns and semanticbrain exploration using instance-conditioned GANs.In 2022 International Joint Conference on Neural Networks,pages 1–8. IEEE, 2022.
- Poole etal. [2022]Ben Poole, Ajay Jain, JonathanT Barron, and Ben Mildenhall.Dreamfusion: Text-to-3D using 2D diffusion.arXiv preprint arXiv:2209.14988, 2022.
- Qi etal. [2017]CharlesRuizhongtai Qi, Li Yi, Hao Su, and LeonidasJ Guibas.Pointnet++: Deep hierarchical feature learning on point sets in ametric space.Advances in Neural Information Processing Systems, 30, 2017.
- Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,etal.Learning transferable visual models from natural languagesupervision.In International Conference on Machine Learning, pages8748–8763. PMLR, 2021.
- Ramesh etal. [2021]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, AlecRadford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In International Conference on Machine Learning, pages8821–8831. PMLR, 2021.
- Ren etal. [2024]Zhiyuan Ren, Minchul Kim, Feng Liu, and Xiaoming Liu.TIGER: Time-varying denoising model for 3D point cloudgeneration with diffusion process.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9462–9471, 2024.
- Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and BjörnOmmer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 10684–10695, 2022.
- Saha and Baumert [2020]Simanto Saha and Mathias Baumert.Intra-and inter-subject variability in eeg-based sensorimotor braincomputer interface: a review.Frontiers in Computational Neuroscience, 13:87,2020.
- Saharia etal. [2022]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyLDenton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, TimSalimans, etal.Photorealistic text-to-image diffusion models with deep languageunderstanding.Advances in Neural Information Processing Systems,35:36479–36494, 2022.
- Salzman etal. [1992]CDaniel Salzman, ChiekoM Murasugi, KennethH Britten, and WilliamT Newsome.Microstimulation in visual area mt: effects on directiondiscrimination performance.Journal of Neuroscience, 12(6):2331–2355,1992.
- Sargent etal. [2024]Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, YunzhiZhang, EricRyan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, etal.Zeronvs: Zero-shot 360-degree view synthesis from a single realimage.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9420–9429, 2024.
- Schirrmeister etal. [2017]RobinTibor Schirrmeister, JostTobias Springenberg, Lukas DominiqueJosefFiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann,Frank Hutter, Wolfram Burgard, and Tonio Ball.Deep learning with convolutional neural networks for eeg decoding andvisualization.Human Mrain Mapping, 38(11):5391–5420,2017.
- Scotti etal. [2024a]Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen,Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, KennethNorman, etal.Reconstructing the mind’s eye: fMRI-to-image with contrastivelearning and diffusion priors.Advances in Neural Information Processing Systems, 36,2024a.
- Scotti etal. [2024b]PaulS Scotti, Mihir Tripathy, Cesar KadirTorrico Villanueva, Reese Kneeland,Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, ThomasNaselaris, KennethA Norman, etal.Mindeye2: Shared-subject models enable fMRI-to-image with 1 hourof data.In International Conference on Machine Learning,2024b.
- Shen etal. [2019]Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani.Deep image reconstruction from human brain activity.PLoS Computational Biology, 15(1):e1006633, 2019.
- Singh etal. [2023]Prajwal Singh, Pankaj Pandey, Krishna Miyapuram, and Shanmuganathan Raman.EEG2IMAGE: image reconstruction from EEG brain signals.In ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing, pages 1–5. IEEE, 2023.
- Song etal. [2022]Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao.Eeg conformer: Convolutional transformer for eeg decoding andvisualization.IEEE Transactions on Neural Systems and RehabilitationEngineering, 31:710–719, 2022.
- Song etal. [2024]Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and XiaorongGao.Decoding natural images from EEG for object recognition.In The Twelfth International Conference on LearningRepresentations, 2024.
- Sun etal. [2024a]Jingyuan Sun, Mingxiao Li, Zijiao Chen, and Marie-Francine Moens.Neurocine: Decoding vivid video sequences from human brain activties.arXiv preprint arXiv:2402.01590, 2024a.
- Sun etal. [2024b]Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, andMarie-Francine Moens.Contrast, attend and diffuse to decode high-resolution images frombrain activities.Advances in Neural Information Processing Systems, 36,2024b.
- Sur and Sinha [2009]Shravani Sur and VinodKumar Sinha.Event-related potential: An overview.Industrial Psychiatry Journal, 18(1):70–73, 2009.
- Takagi and Nishimoto [2023]Yu Takagi and Shinji Nishimoto.High-resolution image reconstruction with latent diffusion modelsfrom human brain activity.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 14453–14463, 2023.
- Vahdat etal. [2022]Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, KarstenKreis, etal.Lion: Latent point diffusion models for 3D shape generation.Advances in Neural Information Processing Systems,35:10021–10039, 2022.
- Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in Neural Information Processing Systems, 2017.
- Wang etal. [2022]Chong Wang, Hongmei Yan, Wei Huang, Jiyi Li, Yuting Wang, Yun-Shuang Fan, WeiSheng, Tao Liu, Rong Li, and Huafu Chen.Reconstructing rapid natural vision with fMRI-conditional videogenerative adversarial network.Cerebral Cortex, 32(20):4502–4511, 2022.
- Wang and Isola [2020]Tongzhou Wang and Phillip Isola.Understanding contrastive representation learning through alignmentand uniformity on the hypersphere.In International Conference on Machine Learning, pages9929–9939. PMLR, 2020.
- Wen etal. [2018]Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, and ZhongmingLiu.Neural encoding and decoding with deep learning for dynamic naturalvision.Cerebral Cortex, 28(12):4136–4160, 2018.
- Wu etal. [2023]Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, and Ajmal Mian.Sketch and text guided diffusion model for colored point cloudgeneration.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 8929–8939, 2023.
- Xu etal. [2024]Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin.Pointllm: Empowering large language models to understand pointclouds.European Conference on Computer Vision, 2024.
- Yi etal. [2024]Ke Yi, Yansen Wang, Kan Ren, and Dongsheng Li.Learning topology-agnostic eeg representations with geometry-awaremodeling.Advances in Neural Information Processing Systems, 36, 2024.
- Yu etal. [2024]Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, and Bin Cui.Accelerating text-to-image editing via cache-enabled sparse diffusioninference.In Proceedings of the AAAI Conference on ArtificialIntelligence, pages 16605–16613, 2024.
- Zhang etal. [2023]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 3836–3847, 2023.
- Zhou etal. [2021]Linqi Zhou, Yilun Du, and Jiajun Wu.3D shape generation and completion through point-voxel diffusion.In Proceedings of the IEEE/CVF International Conference onComputer Vision, pages 5826–5835, 2021.
- Zhu etal. [2023]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advancedlarge language models.arXiv preprint arXiv:2304.10592, 2023.
\thetitle
Supplementary Material
7 EEG Data Preprocessing
In this section, we introduce the details of EEG preprocessing pipeline. During data acquisition, static 3D image and dynamic 3D video stimuli were preceded by a marker to streamline subsequent data processing.The continuous EEG recordings were subsequently preprocessed using MNE [17]. The data were segmented into fixed-length epochs (1s for static stimuli and 6s for dynamic stimuli), time-locked to stimulus onset, with baseline correction achieved by subtracting the mean signal amplitude during the pre-stimulus period. The signals were downsampled from 1000 Hz to 250 Hz, and a bandpass filter (0.1–100 Hz) was applied in conjunction with a 50 Hz notch filter to mitigate noise. To normalize signal amplitude variability across channels, multivariate noise normalization was employed [22]. Consistent with established practices [35], two stimulus repetitions were treated as independent samples during training to enhance learning, while testing involved averaging across four repetitions to improve the signal-to-noise ratio, following principles similar to those used in Event-Related Potential (ERP) analysis [68].
8 Evaluation Metrics for Reconstruction Benchmark
To assess the quality of the generated outputs, we adopt the N-way, top-K metric, a standard approach in 2D image decoding [35, 9, 8]. For 2D image evaluation, a pre-trained ImageNet1K classifier is used to classify both the generated images and their corresponding ground truth images. Similarly, we utilize data from Objaverse [10] to pre-train a PointNet++ model [50]. To ensure classifier reliability, the network is trained on all Objaverse data with category labels, excluding the test set used in our study. The point cloud data corresponding to the 3D objects is sourced from [76]. During evaluation, both the generated point clouds and their corresponding ground truth point clouds are classified using the trained network. The results are then analyzed to confirm whether the reconstructed object is correctly identified within the top K categories among N selected.For the efficiency of evaluation, we utilize data from the first five subjects to train and evaluate the reconstruction model.Moreover, a distinct feature of the diffusion model is its dependence on initialization noise, which can influence the generated outputs. We perform five independent inferences for each object and compute the average N-way, top-K metric across these runs. Additionally, to capture the potential best-case performance, we identify the optimal result based on the classifier’s predicted scores across the five inferences and compute the N-way, top-K metric.
9 Analysis of Individual Difference
We present the performance variability across individuals on two classification tasks, as illustrated in Fig. 6. On both tasks, individual performance consistently exceeds chance level, demonstrating that EEG signals encode visual perception information and that our method effectively extracts and utilizes this information for decoding. Notably, performance varies across tasks for the same individual. For instance, participant performs significantly below average in object classification but achieves above-average results in color classification, suggesting distinct neural mechanisms underlying the processing of different visual attributes and their representation in EEG signals.
Furthermore, it has been widely confirmed that EEG signal has substantial individual variations [15, 55, 27]. As shown in Fig. 6, significant differences are observed between individuals performing the same task, particularly in object classification, where and exhibit superior performance, while , and fall markedly below average. Similar variability is observed in the color classification, albeit to a lesser extent. These results verify the pronounced inter-subject differences in EEG signals and highlight a critical challenge for cross-subject EEG visual decoding, where performance remains suboptimal. Addressing this variability is a key focus for future research.
10 More Reconstructed Samples
Additional reconstructed results alongside their corresponding ground truth point clouds are presented in Fig. 7. The proposed Neuro-3D framework exhibits robust performance, effectively capturing semantic categories, shape details, and the overall color of various objects.
11 Analysis of Failure Cases
Fig. 8 illustrates representative failure cases, categorized into two principal types: inaccuracies in detailed shape prediction and semantic reconstruction errors. Despite these limitations, certain features of the stimulus objects, including shape contours and color information, are partially preserved in the displayed reconstructed images.These shortcomings primarily arise from the inherent challenges of the low signal-to-noise ratio and limited spatial resolution of EEG signals, which constrain the performance of 3D object reconstruction. Addressing these issues presents a promising direction for future improvement.