Adversarial examples are small perturbation to an example that is negligible to humans but changes the decision of a computer system. It is first discovered in object recognition (Szegedy et al. 2014) but later found in natural language systems as well (Jia and Liang, 2017). This phenomenon is broadly popularized via news about autonomous cars misinterpreting stop signs as speed limit signs, state-of-the-art computer vision systems misinterpreting cats as desktop computers, mistaking face for non-face, gibberish patterns for faces, and one face for another. The phenomenon reveals a fundamental flaw in a big class of classifiers (Goodfellow et al. 2014).
Subspaces of transferable adversarial examples: Tramèr et al. (2017)
Universal adversarial perturbation: https://arxiv.org/pdf/1610.08401.pdf
From Goodfellow (2017):
- “Adversarial Classification” Dalvi et al 2004: fool spam filter
- “Evasion Attacks Against Machine Learning at Test Time”
- Biggio 2013: fool neural nets
- Szegedy et al 2013: fool ImageNet classifiers imperceptibly
- Goodfellow et al 2014: cheap, closed form attack
- Distillation: Papernot et al. (2016)
- ↑ Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http: //arxiv.org/abs/1312.6199.
- ↑ Jia, Robin, and Percy Liang. "Adversarial Examples for Evaluating Reading Comprehension Systems." arXiv preprint arXiv:1707.07328 (2017).
- ↑ Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
- ↑ Tramèr, Florian, et al. "The Space of Transferable Adversarial Examples." arXiv preprint arXiv:1704.03453 (2017).