Apple Uses New Tech That Compares Synthetic Data With Real Emails To Train AI Models, Then Applies Embeddings And Privacy Tools To Improve Text Output Quality
Apple was supposed to release its highly anticipated Personalized Siri feature last month with the release of iOS 18.4. Still, it was later confirmed that the new utility would be delayed until next year. A new report has emerged that shares details on how Apple trains its AI models with reference to Apple Intelligence.
Apple’s AI models get smarter using synthetic emails while ensuring complete user privacy protection throughout
Even though Apple officially stated that the Personalized Siri features will be delayed until next year, employees within the company are growing confident that the feature will be ready for launch later this year. In a new report, Bloomberg highlights how Apple trains its AI models for Apple Intelligence. The report cites a blog post from Apple's Machine Learning Research website, describing how Apple uses synthetic data to train its AI models.
We have previously reported on several occasions that Apple is lagging behind in the AI race with its competitors, and the company's strategy to use synthetic data to train AI models is a bit unconventional and has limitations. For one, it is cumbersome for the data to "understand trends" when it comes to summarization or writing tools that require longer sentences or full-fledged emails.
Apple took note of this and highlighted a new technology that will allow it to circumvent the limitations by comparing the synthetic data to a sample of recent user emails. However, the process does not compromise user privacy.
To improve our models we need to generate a set of many emails that cover topics that are most common in messages. To curate a representative set of synthetic emails, we start by creating a large set of synthetic messages on a variety of topics. For example, we might create a synthetic message, “Would you like to play tennis tomorrow at 11:30AM?”
This is done without any knowledge of individual user emails. We then derive a representation, called an embedding, of each synthetic message that captures some of the key dimensions of the message like language, topic, and length. These embeddings are then sent to a small number of user devices that have opted in to Device Analytics.
Participating devices then select a small sample of recent user emails and compute their embeddings. Each device then decides which of the synthetic embeddings is closest to these samples. Using differential privacy, Apple can then learn the most-frequently selected synthetic embeddings across all devices, without learning which synthetic embedding was selected on any given device.
These most-frequently selected synthetic embeddings can then be used to generate training or testing data, or we can run additional curation steps to further refine the dataset. For example, if the message about playing tennis is one of the top embeddings, a similar message replacing “tennis” with “soccer” or another sport could be generated and added to the set for the next round of curation (see Figure 1). This process allows us to improve the topics and language of our synthetic emails, which helps us train our models to create better text outputs in features like email summaries, while protecting privacy.
While the company is aware of the limitations, it explains that the new technology will allow it to better understand the overall trends without compromising user privacy or gathering information. Bloomberg also claims that the company will release the new technology in a new beta of iOS 18.5 and macOS 15.5. You can check out Apple's full post on the matter for more details.