Natural Language Understanding Wiki

Adversarial examples are small perturbation to an example that is negligible to humans but changes the decision of a computer system. It is first discovered in object recognition (Szegedy et al. 2014)[1] but later found in natural language systems as well (Jia and Liang, 2017)[2]. In terms of models, neural networks, linear models (e.g. SVM) and decision trees are known to suffer from adversarial examples (Zhou et al. 2021[3], among others). The phenomenon is broadly popularized via news about autonomous cars misinterpreting stop signs as speed limit signs, state-of-the-art computer vision systems misinterpreting cats as desktop computers, mistaking face for non-face, gibberish patterns for faces, and one face for another. The phenomenon reveals a fundamental flaw in a big class of classifiers (Goodfellow et al. 2014)[4].


Subspaces of transferable adversarial examples: Tramèr et al. (2017)[5]

Universal adversarial perturbation:


From Goodfellow (2017):

  1. “Adversarial Classification” Dalvi et al 2004: fool spam filter
  2. “Evasion Attacks Against Machine Learning at Test Time”
  3. Biggio 2013: fool neural nets
  4. Szegedy et al 2013: fool ImageNet classifiers imperceptibly
  5. Goodfellow et al 2014: cheap, closed form attack


TODO: a survey with a list of hypotheses: Serban et al. (2020)[6]

  • Linearity: Goodfellow et al. (2014)[4]
  • Data complexity of robust generalization (with no prior at all? what about robust generalization with the right prior?): Schmidt et al. (2018)[7]
  • "Identifying a robust classifier from limited training data is information theoretically possible but computationally intractable" (at least for a family of models called "statistical query"): Bubeck et al. (2018)[8]
  • "high dimensional geometry of data manifold" (but hey, people can do it...): Gilmer et al. (2018)[9]
  • inevitable consequence of "concentration of measure" in metric measure space (but does our problem has it?): Mahloujifar et al. (2019)[10]
  • non-robust features (of the input) that are useful for normal classification but not for robust classification: Ilyas et al. (2019)[11]
    • "We define a feature to be a function mapping from the input space X to the real numbers, ... Note that this formal definition also captures what we abstractly think of as features (e.g., we can construct an f that captures how “furry” an image is)"

Some claim that adversarial examples are inevitable (hey, humans seem to be robust against them?):

  • Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier, 2018. URL
  • Justin Gilmer, Luke Metz, Fartash Faghri, Sam Schoenholz, Maithra Raghu, Martin Wat- tenberg, and Ian Goodfellow. Adversarial spheres. In International Conference on Learning Representations Workshop, 2018. URL

Adversarial examples in computer vision[]


  • Object recognition: see the survey Serban et al. (2020)[6]
  • Edge detection: Cosgrove and Yuille (2020)[12]
  • Semantic segmentation: Xie et al. (2017)[13]
  • Facial recognition: Sharif et al. (2016)[14]
  • Video classification: Li et al. (2018)[15]

TODO: Simen Thys, Wiebe Van Ranst, and Toon Goedemé. 2019. Fooling automated surveillance cameras: Adversarial patches to attack person detection. arXiv:1904.08653 (2019).

TODO: Xingxing Wei, Siyuan Liang, Xiaochun Cao, and Jun Zhu. 2018. Transferable adversarial attacks for image and video object detection. arXiv:1811.12641 (2018).


TODO: find refs

  • Small perturbation/imperceptible
    • norm constrained (l2, l-inf, etc.): lots of papers
    • "perceptual feature fidelity" constraint? [16]
    • SmoothFool[17]
  • Color attacks
    • Negative images[18]
    • Random color substitution[19]
    • ColorFool[20]
    • Small recoloring (combined with perturbation)[21]
    • contrast, brightness, grayscale conversion, intensity, solarize: Volpi & Murino (2019)[22]
    • lots of filters: FilterFool (Shamsabadi et al. 2020)[23]
    • Adversarial color enhancement (Zhao et al. 2020)[24][25]
    • more color filters: Kantipudi et al. (2020)[26]
    • yet more color perturbation: Bhattad et al. (2020)[27]
  • structure-preserving
    • Peng et al. (2018)???[28]
    • sharpness: Volpi & Murino (2019)[22]
    • EdgeFool[29]
    • Shifting/deforming: [30]
    • Rotation & translation: Li et al. 2020[31]
  • camera shake???[32]
  • Few-pixel attack
    • One pixel
    • k pixels
  • Semantic attacks
  • Shadow attacks[33]
  • Justaposition/occlusion attacks
    • "Adversarial turtles"
    • "Invisible cloaks"
    • banners
    • Adver-sarial Laser Beam: Duan et al. (2021)[34]
    • Adversarial Camouflage[35]
  • Feature-space attacks (white box, using the internal features to craft images)
    • D2B (Xu et al. 2020)[36]
    • Xu et al. (2020)[37]
  • Generative attacks (not using internal features)
    • using GAN: Song et al. (2018)[38]
    • differentiable rendering: Jain (2020)[39]
    • VAE??[40]
    • pose???[32]
    • more GAN??[41]
    • yet more GAN?? [42]
    • style transfer (texture)?[27]
  • Structure-preserving attack? Peng et al. (2020)[43]
  • move across time (in a video): Shankar et al. (2019)[44]


Adversarial training: so far the most successful defense.

TODO: lots and lots of defences

  • SVD (Jere et al. 2020)[45]

Confirmed fails[]

TODO: a lot fall into this category

Works on simple datasets only[]

  • Tested on MNIST only: convex outer polytope (Wong & Kolter, 2008[50])?

Adversarial examples in natural language processing[]


Classifying text in to categories (e.g. Sports, Business), reviews into good/bad (Soll et al. 2019) [51]


From (Soll et al. 2019) [51]: "algorithm by Samanta and Mehta [22], where the candidate pool P, from which possible words for insertion and re- placement are drawn, was created from the following sources:

  • Synonyms gathered from the WordNet dataset [5],
  • Typos from a dataset [16] to ensure that the typos inserted are not recognized as artificial since they occur in normal texts written by humans, and
  • Keywords specific for one input class which were found by looking at all training sentences and extracting words only found in one class."


Distillation is shown to be ineffective (again) (Soll et al. 2019) [51]

TODO: Jia and Jiang (2017)[52]: data augmentation not effective?

Adversarial examples in other machine learning areas[]

TODO: in reinforcement learning

TODO: from Serban et al. (2020)[6] ismalware detection [68, 78, 94, 101, 179], because it implies direct consequences on security. Other tasks such as reinforcement learning [10, 80, 106], speech recognition [23, 27], facial recognition [150], semantic segmentation [178] [...] are also explored

TODO: Yefet, N., Alon, U., & Yahav, E. (2020). Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4(OOPSLA), 1-30.



  • Careful to avoid gradient obfuscation: Athalye et al.[46]
    • Check if random/transfer/blackbox attacks are more effective than whitebox attacks (red flag)
  • Always hand-design adaptive attacks for evaluation (unless simpler attacks suffice), careful as many things can go wrong there: Tramer et al. (2020)[53]. The authors identified 6 "themes" on how to create effective adaptive attacks:
    • T0: Strive for simplicity
    • T1: Attack (a function close to) the full defense
    • T2: Identify and target important defense parts
    • T3: Adapt the objective to simplify the attack
    • T4: Ensure the loss function is consistent
    • T5: Optimize the loss function with different methods
    • T6: Use a strong adaptive attack for adversarial training



TODO: Qin et al. (2019)[54]

Datasets with built-in perturbation[]

These are easier to use but are weaker.



  1. Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http: //
  2. Jia, Robin, and Percy Liang. "Adversarial Examples for Evaluating Reading Comprehension Systems." arXiv preprint arXiv:1707.07328 (2017).
  3. Zhou, D., Liu, T., Han, B., Wang, N., Peng, C., & Gao, X. (2021). Towards Defending against Adversarial Examples via Attack-Invariant Features. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 12835–12845). PMLR. Retrieved from
  4. 4.0 4.1 Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
  5. Tramèr, Florian, et al. "The Space of Transferable Adversarial Examples." arXiv preprint arXiv:1704.03453 (2017).
  6. 6.0 6.1 6.2 Serban, A., Poll, E., & Visser, J. (2020). Adversarial Examples on Object Recognition. ACM Computing Surveys, 53(3), 1–38.
  7. Schmidt, L., Talwar, K., Santurkar, S., Tsipras, D., & Madry, A. (2018). Adversarially robust generalization requires more data. Advances in Neural Information Processing Systems, 2018-Decem(NeurIPS), 5014–5026.
  8. Bubeck, S., Price, E., & Razenshteyn, I. (2018). Adversarial examples from computational constraints, 1–19. Retrieved from
  9. Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., & Goodfellow, I. (2018). The Relationship Between High-Dimensional Geometry and Adversarial Examples. Retrieved from
  10. Mahloujifar, S., Diochnos, D. I., & Mahmoody, M. (2019). The Curse of Concentration in Robust Learning: Evasion and Poisoning Attacks from Concentration of Measure. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4536–4543.
  11. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial Examples Are Not Bugs, They Are Features. Retrieved from
  12. Cosgrove, C., & Yuille, A. L. (2020). Adversarial examples for edge detection: They exist, and they transfer. Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020, 1059–1068.
  13. Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, A. (2017). Adversarial Examples for Semantic Segmentation and Object Detection. Proceedings of the IEEE International Conference on Computer Vision, 2017-Octob, 1378–1387.
  14. Sharif, M., Bhagavatula, S., Bauer, L., & Reiter, M. K. (2016). Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. Proceedings of the ACM Conference on Computer and Communications Security, 1528–1540.
  15. Li, S., Neupane, A., Paul, S., Song, C., Krishnamurthy, S. V., Chowdhury, A. K. R., & Swami, A. (2018). Adversarial Perturbations Against Real-Time Video Classification Systems. ArXiv.
  16. Quan, P., Guo, R., & Srivastava, M. (n.d.). Towards Imperceptible Query-limited Adversarial Attacks with Perceptual Feature Fidelity Loss, 1–11.
  17. Dabouei, A., Soleymani, S., Taherkhani, F., Dawson, J., & Nasrabadi, N. M. (2020). SmoothFool: An efficient framework for computing smooth adversarial perturbations. Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020, 2654–2663.
  18. Hosseini, H., Xiao, B., Jaiswal, M., & Poovendran, R. (2017). On the limitation of convolutional neural networks in recognizing negative images. Proceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017, 352–358.
  19. Hosseini, H., & Poovendran, R. (2018). Semantic Adversarial Examples. CVPR 2018, 1727–1732. Retrieved from
  20. Shahin Shamsabadi, A., Sanchez-Matilla, R., & Cavallaro, A. (2020). ColorFool: Semantic Adversarial Colorization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1148–1157.
  21. Laidlaw, C., & Feizi, S. (2019). Functional Adversarial Attacks. ArXiv, (NeurIPS), 1–16.
  22. 22.0 22.1 Volpi, R., & Murino, V. (2019). Addressing model vulnerability to distributional shifts over image transformation sets. ArXiv.
  23. Peng, D., Zheng, Z., Luo, L., & Zhang, X. (2020). Structure matters: Towards generating transferable adversarial images. Frontiers in Artificial Intelligence and Applications, 325, 1419–1426.
  24. Zhao, Z., Liu, Z., & Larson, M. (2020). Adversarial color enhancement: Generating unrestricted adversarial images by optimizing a color filter. ArXiv, 1–14.
  25. Zhao, Z., Liu, Z., & Larson, M. (2020). Adversarial robustness against image color transformation within parametric filter space. ArXiv, 1–20.
  26. Kantipudi, J., Dubey, S. R., & Chakraborty, S. (2020). Color Channel Perturbation Attacks for Fooling Convolutional Neural Networks and A Defense Against Such Attacks. IEEE Transactions on Artificial Intelligence. Retrieved from
  27. 27.0 27.1 Bhattad, A., Chong, M. J., Liang, K., Li, B., & Forsyth, D. A. (2019). Unrestricted adversarial examples via semantic manipulation. ArXiv, (2018), 1–19.
  28. Peng, D., Zheng, Z., & Zhang, X. (2018). Structure-preserving transformation: Generating diverse and transferable adversarial examples. ArXiv.
  29. Shamsabadi, A. S., Oh, C., & Cavallaro, A. (2020). Edgefool: An Adversarial Image Enhancement Filter. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May(2), 1898–1902.
  30. Naderi, H., Goli, L., & Kasaei, S. (2021). Generating Unrestricted Adversarial Examples via Three Parameters. Retrieved from
  31. Li, L., Weber, M., Xu, X., Rimanic, L., Xie, T., Zhang, C., & Li, B. (2020). Provable robust learning based on transformation-specific smoothing. In ICML Workshop on Uncertainty & Robustness in Deep Learning (UDL) 2020.
  32. 32.0 32.1 Ho, C. H., Leung, B., Sandstrom, E., Chang, Y., & Vasconcelos, N. (2019). Catastrophic child’s play: Easy to perform, hard to defend adversarial attacks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June, 9221–9229.
  33. Ghiasi, A., Shafahi, A., & Goldstein, T. (2020). Breaking certified defenses: semantic adversarial examples with spoofed robustness certificates. In International Conference on Learning Representations. Retrieved from
  34. Duan, R., Mao, X., Qin, A. K., Yang, Y., Chen, Y., Ye, S., & He, Y. (2021). Adversarial Laser Beam: Effective Physical-World Attack to DNNs in a Blink. Retrieved from
  35. Duan, R., Ma, X., Wang, Y., Bailey, J., Qin, A. K., & Yang, Y. (2020). Adversarial camouflage: Hiding physical-world attacks with natural styles. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 997–1005.
  36. Xu, Q., Tao, G., & Zhang, X. (2020). D2B: Deep distribution bound for natural-looking adversarial attack. ArXiv, 1–26.
  37. Xu, Q., Tao, G., Cheng, S., Tan, L., & Zhang, X. (2020). Towards feature space adversarial attack. ArXiv.
  38. Song, Y., Kushman, N., Shu, R., & Ermon, S. (2018). Constructing unrestricted adversarial examples with generative models. Advances in Neural Information Processing Systems, 2018-December(NeurIPS), 8312–8323.
  39. Jain, L. (2020). Generating Semantic Adversarial Examples through Differentiable Rendering.
  40. Wang, D., Li, C., Wen, S., Nepal, S., & Xiang, Y. (2019). Man-in-the-middle attacks against machine learning classifiers via malicious generative models. ArXiv, (October), 1–12.
  41. Dunn, I., Hanu, L., Pouget, H., Kroening, D., & Melham, T. (2020). Evaluating Robustness to Context-Sensitive Feature Perturbations of Different Granularities. Retrieved from
  42. Song, Y., Kushman, N., Shu, R., & Ermon, S. (2018). Generative Adversarial Examples, 8312–8323.
  43. Peng, D., Zheng, Z., Luo, L., & Zhang, X. (n.d.). Structure Matters : Towards Generating Transferable Adversarial Images.
  44. Shankar, V., Dave, A., Roelofs, R., Ramanan, D., Recht, B., & Schmidt, L. (2019). Do Image Classifiers Generalize Across Time? Retrieved from
  45. Jere, M., Kumar, M., & Koushanfar, F. (2020). A singular value perspective on model robustness. ArXiv.
  46. 46.0 46.1 Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. 35th International Conference on Machine Learning, ICML 2018, 1, 436–448.
  47. Buckman, J., Roy, A., Raffel, C., & Goodfellow, I. (2018). Thermometer encoding: One hot way to resist adversarial examples. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, (2016), 1–22.
  48. Croce, F., & Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. 37th International Conference on Machine Learning, ICML 2020, PartF168147-3, 2184–2194.
  49. Taghanaki, S. A., Abhishek, K., Azizi, S., and Hamarneh, G. A kernelized manifold mapping to diminish the effect of adversarial perturbations. In CVPR, 2019.
  50. Wong, E., & Kolter, J. Z. (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. 35th International Conference on Machine Learning, ICML 2018, 12, 8405–8423.
  51. 51.0 51.1 51.2 Soll, M., Hinz, T., Magg, S., & Wermter, S. (2019). Evaluating Defensive Distillation for Defending Text Processing Neural Networks Against Adversarial Examples. International Conference on Artificial Neural Networks (ICANN), 685–696.
  52. Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension sys- tems. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 2021–2031 (2017). DOI: 10.18653/v1/D17-1215
  53. Tramer, F., Carlini, N., Brendel, W., & Madry, A. (2020). On adaptive attacks to adversarial example defenses. ArXiv Preprint ArXiv:2002.08347.
  54. Qin, C., Martens, J., Gowal, S., Krishnan, D., Krishnamurthy, Dvijotham, … Kohli, P. (2019). Adversarial Robustness through Local Linearization, (NeurIPS), 1–17. Retrieved from