D

Di Wang

Physicians Committee for Responsible Medicine

ORCID: 0000-0001-6360-4360

Publishes on Remote-Sensing Image Classification, Advanced Image and Video Retrieval Techniques, Image Retrieval and Classification Techniques. 84 papers and 2.2k citations.

84Publications
2.2kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

INTERACTION Dataset: An INTERnational, Adversarial and Cooperative moTION Dataset in Interactive Driving Scenarios with Semantic Maps
Wei Zhan, Liting Sun, Di Wang et al.|arXiv (Cornell University)|2019
Cited by 355Open Access

Behavior-related research areas such as motion prediction/planning, representation/imitation learning, behavior modeling/generation, and algorithm testing, require support from high-quality motion datasets containing interactive driving scenarios with different driving cultures. In this paper, we present an INTERnational, Adversarial and Cooperative moTION dataset (INTERACTION dataset) in interactive driving scenarios with semantic maps. Five features of the dataset are highlighted. 1) The interactive driving scenarios are diverse, including urban/highway/ramp merging and lane changes, roundabouts with yield/stop signs, signalized intersections, intersections with one/two/all-way stops, etc. 2) Motion data from different countries and different continents are collected so that driving preferences and styles in different cultures are naturally included. 3) The driving behavior is highly interactive and complex with adversarial and cooperative motions of various traffic participants. Highly complex behavior such as negotiations, aggressive/irrational decisions and traffic rule violations are densely contained in the dataset, while regular behavior can also be found from cautious car-following, stop, left/right/U-turn to rational lane-change and cycling and pedestrian crossing, etc. 4) The levels of criticality span wide, from regular safe operations to dangerous, near-collision maneuvers. Real collision, although relatively slight, is also included. 5) Maps with complete semantic information are provided with physical layers, reference lines, lanelet connections and traffic rules. The data is recorded from drones and traffic cameras. Statistics of the dataset in terms of number of entities and interaction density are also provided, along with some utilization examples in a variety of behavior-related research areas. The dataset can be downloaded via https://interaction-dataset.com.

Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model
Di Wang, Qiming Zhang, Yufei Xu et al.|IEEE Transactions on Geoscience and Remote Sensing|2022
Cited by 262

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers (ViTs) being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this article, we resort to plain ViTs with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mean average precision (mAP) on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring. The code and models will be released at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA</uri> .

An Empirical Study of Remote Sensing Pretraining
Di Wang, Jing Zhang, Bo Du et al.|IEEE Transactions on Geoscience and Remote Sensing|2022
Cited by 216Open Access

Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights since natural images inevitably present a large domain gap relative to aerial images, probably limiting the fine-tuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of RS pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now—MillionAID—to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNNs) and vision transformers, such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks, including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS-related semantics, such as “Bridge” and “Airplane.” We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing</uri> .

Adaptive Spectral–Spatial Multiscale Contextual Feature Extraction for Hyperspectral Image Classification
Di Wang, Bo Du, Liangpei Zhang et al.|IEEE Transactions on Geoscience and Remote Sensing|2020
Cited by 129

In this article, we propose an end-to-end adaptive spectral-spatial multiscale network to extract multiscale contextual information for hyperspectral image (HSI) classification, which contains spectral feature extraction (FE) and spatial FE subnetworks. In spectral FE aspect, different from previous methods where features are obtained in a single scale, which limits the accuracy improvement, we propose two schemes based on band grouping strategy, and the long short-time memory (LSTM) model is used for perceiving spectral multiscale information. In spatial subnetwork, on the foundation of existing multiscale architecture, the spatial contextual features which are usually ignored by previous literature are successfully obtained under the aid of convolutional LSTM (ConvLSTM) model. Besides, a new spatial grouping strategy is proposed for convenience of ConvLSTM to extract the more discriminative features. Then, a novel adaptive feature combining way is proposed considering the different importance of spectral and spatial parts. Experiments on three public data sets in HSI community demonstrate that our methods achieve competitive results compared with other state-of-the-art methods.