Facepalm: Training generative AI models requires coordination and cooperation among many developers, and additional security checks should be in place. Microsoft is clearly lacking in this regard, as the corporation has put vast amounts of data at risk for years.
Between July 20, 2020, and June 24, 2023, Microsoft exposed a vast trove of data to the public web through a GitHub public repository. Cloud security company Wiz discovered and reported the issue to Microsoft on June 22, 2023, and the company invalidated its insecure token two days later. The incident is only being disclosed to the public now, as Wiz unveiled the security ordeal on its official blog.
By incorrectly using a feature of the Azure platform known as Shared Access Signature (SAS) tokens, Wiz researchers say Microsoft accidentally exposed 38 terabytes of private data on the robust-models-transfer GitHub repository. The archive was used to host open source code and AI models for image recognition, and Microsoft AI researchers were sharing their files through an excessively permissive SAS token.
SAS tokens provide a way to share signed URLs to grant granular access to data hosted on Azure Storage instances. The access level can be customized by the user, and the particular SAS token employed by Microsoft researchers was pointing at a misconfigured Azure storage bucket containing loads of sensitive data.
Besides training data for its AI models, Microsoft exposed a disk backup of two employees’ workstations, according to Wiz. The backup included “secrets,” private cryptographic keys, passwords, and over 30,000 internal Microsoft Teams messages belonging to 359 Microsoft employees. A total of 38 TB of private files could have been accessed by anyone, at least until Microsoft revoked the dangerous SAS token on June 24, 2023.
Despite their usefulness, SAS tokens pose a security risk due to a lack of monitoring and governance. Wiz says that their usage should be “as limited as possible,” as the tokens are hard to track because Microsoft doesn’t provide a centralized way to manage them through the Azure portal.
Furthermore, SAS tokens can be configured to last “effectively forever,” as Wiz explains. The first token that Microsoft committed to its AI GitHub repository was added on July 20, 2020, and it remained valid until October 5, 2021. A second token was subsequently added to GitHub, with an expiration date set for October 6, 2051.
Microsoft’s multi-terabyte incident highlights the risks associated with AI model training, according to Wiz. This emerging technology requires “large sets of data to train on,” the researchers explain, with many development teams handling “massive amounts of data,” sharing it with their peers, or collaborating on public open-source projects. Instances like Microsoft’s are becoming “increasingly hard to monitor and avoid.”