Adversarial examples are small perturbation to an example that is negligible to humans but changes the decision of a computer system. It is first discovered in object recognition (Szegedy et al. 2014)[1] but later found in natural language systems as well (Jia and Liang, 2017)[2]. This phenomenon is broadly popularized via news about autonomous cars misinterpreting stop signs as speed limit signs, state-of-the-art computer vision systems misinterpreting cats as desktop computers, mistaking face for non-face, gibberish patterns for faces, and one face for another. The phenomenon reveals a fundamental flaw in a big class of classifiers (Goodfellow et al. 2014)[3].


Subspaces of transferable adversarial examples: Tramèr et al. (2017)[4]

Universal adversarial perturbation:

History Edit

From Goodfellow (2017):

  1. “Adversarial Classification” Dalvi et al 2004: fool spam filter
  2. “Evasion Attacks Against Machine Learning at Test Time”
  3. Biggio 2013: fool neural nets
  4. Szegedy et al 2013: fool ImageNet classifiers imperceptibly
  5. Goodfellow et al 2014: cheap, closed form attack

Explanation Edit

  • Linearity: Goodfellow et al. (2014)[3]
  • Data complexity of robust generalization (with no prior at all? what about robust generalization with the right prior?): Schmidt et al. (2018)[5]
  • "Identifying a robust classifier from limited training data is information theoretically possible but computationally intractable" (at least for a family of models called "statistical query"): Bubeck et al. (2018)[6]
  • "high dimensional geometry of data manifold" (but hey, people can do it...): Gilmer et al. (2018)[7]
  • inevitable consequence of "concentration of measure" in metric measure space (but does our problem has it?): Mahloujifar et al. (2019)[8]
  • non-robust features (of the input) that are useful for normal classification but not for robust classification: Ilyas et al. (2019)[9]
    • "We define a feature to be a function mapping from the input space X to the real numbers, ... Note that this formal definition also captures what we abstractly think of as features (e.g., we can construct an f that captures how “furry” an image is)"

Some claim that adversarial examples are inevitable (hey, humans seem to be robust against them?):

  • Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier, 2018. URL
  • Justin Gilmer, Luke Metz, Fartash Faghri, Sam Schoenholz, Maithra Raghu, Martin Wat- tenberg, and Ian Goodfellow. Adversarial spheres. In International Conference on Learning Representations Workshop, 2018. URL

Adversarial examples in natural language processing Edit

Tasks Edit

Classifying text in to categories (e.g. Sports, Business), reviews into good/bad (Soll et al. 2019) [10]

Attacks Edit

From (Soll et al. 2019) [10]: "algorithm by Samanta and Mehta [22], where the candidate pool P, from which possible words for insertion and re- placement are drawn, was created from the following sources:

  • Synonyms gathered from the WordNet dataset [5],
  • Typos from a dataset [16] to ensure that the typos inserted are not recognized as artificial since they occur in normal texts written by humans, and
  • Keywords specific for one input class which were found by looking at all training sentences and extracting words only found in one class."

Defenses Edit

Distillation is shown to be ineffective (again) (Soll et al. 2019) [10]

TODO: Jia and Jiang (2017)[11]: data augmentation not effective?

Evaluation Edit

CIFAR-10 Edit

TODO: Qin et al. (2019)[12]

Failed defenses Edit

Software Edit

References Edit

  1. Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http: //
  2. Jia, Robin, and Percy Liang. "Adversarial Examples for Evaluating Reading Comprehension Systems." arXiv preprint arXiv:1707.07328 (2017).
  3. 3.0 3.1 Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
  4. Tramèr, Florian, et al. "The Space of Transferable Adversarial Examples." arXiv preprint arXiv:1704.03453 (2017).
  5. Schmidt, L., Talwar, K., Santurkar, S., Tsipras, D., & Madry, A. (2018). Adversarially robust generalization requires more data. Advances in Neural Information Processing Systems, 2018-Decem(NeurIPS), 5014–5026.
  6. Bubeck, S., Price, E., & Razenshteyn, I. (2018). Adversarial examples from computational constraints, 1–19. Retrieved from
  7. Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., & Goodfellow, I. (2018). The Relationship Between High-Dimensional Geometry and Adversarial Examples. Retrieved from
  8. Mahloujifar, S., Diochnos, D. I., & Mahmoody, M. (2019). The Curse of Concentration in Robust Learning: Evasion and Poisoning Attacks from Concentration of Measure. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4536–4543.
  9. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial Examples Are Not Bugs, They Are Features. Retrieved from
  10. 10.0 10.1 10.2 Soll, M., Hinz, T., Magg, S., & Wermter, S. (2019). Evaluating Defensive Distillation for Defending Text Processing Neural Networks Against Adversarial Examples. International Conference on Artificial Neural Networks (ICANN), 685–696.
  11. Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension sys- tems. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 2021–2031 (2017). DOI: 10.18653/v1/D17-1215
  12. Qin, C., Martens, J., Gowal, S., Krishnan, D., Krishnamurthy, Dvijotham, … Kohli, P. (2019). Adversarial Robustness through Local Linearization, (NeurIPS), 1–17. Retrieved from
  13. Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. 35th International Conference on Machine Learning, ICML 2018, 1, 436–448.
Community content is available under CC-BY-SA unless otherwise noted.