It was (very) early morning, as I stood in the immigration queue at Singapore airport and the line was moving slowly.

I began to fiddle with my phone and turned to Facebook. In my tired boredom, I clicked on an app that promised to tell me which Star Wars character I was most like. I paused momentarily, to consider the ramifications of handing over some of my profile information, to a trivial game. However, I quickly shrugged off this concern. I wanted to know the answer. I’m Chewbacca. Excellent. My micro-transaction – trading some of my own information to find out I’m wookie-like – seemed harmless enough. But was there more to it?

Apps and the masses – how do algorithms learn about me?

 The Cambridge Analytica controversy has put the subject of data misuse firmly in the spotlight. The company allegedly harvested millions of Facebook profiles, unbeknownst to the users, via a personality quiz. One whistle blower claims the firm used this harvested data for political means – and that its use may even have helped swing Brexit and the outcome of the 2016 US presidential election. Whether it did or not, remains unanswered but there is certainly a debate to be had around this topic.

In my view this scandal is a small drop in the preverbal data ocean. Personal information is widely used to feed the machine learning models that are trying to learn about personal decision making.

The more data, and the more detailed the information is, the better these models can perform.

However, the process of using personal information to learn is not as intrusive as some may think. As people we suffer from many behavioural biases, especially when it comes to assessing abstract concepts like probabilities. One of these is called ‘availability heuristics’. These are judgements about the probability of events based on how easy it is to think of examples. This puts the emphasis on individual and case-specific experience.

For example:

 I knew Sarah was in financial trouble because she posted photos of expensive jewellery. I think all people that post pictures of jewellery are likely to be in financial trouble.  

My experience with Sarah’s personal information is therefore key to my assessment, and her idiosyncrasies form the basis of my judgement about others as well.

Algorithms work very differently. They try to find common ‘factors’ amongst a vast population of individuals’ actions. For instance, an algorithm trying to forecast financial problems may use information about postings (including jewellery) along with thousands of other indicators, as it tries to ascertain if there is a link between the two things. ‘Sarah’ here is just one tiny component of a tumble-dryer of data.  Her idiosyncratic information, which is not observed in others, is quickly dismissed as mere noise.

Sarah’s information is effectively ‘washed’ with many others, until only a fraction of her unique characteristics are incorporated into a model to forecast financial distress.


Which algorithm am I helping? 

Artificial intelligence models, like a personalised recommendation system, are generally considered a good thing. Offers of new television shows and books to read that fit with an individual’s interests are time-saving and value adding. The cost is that emails, voice and other personal information are used to train these large-scale models for the purpose of creating a better algorithm for everyone to enjoy. You may imagine that Amazon’s Echo and Google Home may fall under this category.

However, it starts to get less comfortable when someone’s data is used to build an algorithm to influence, manipulate or limit them, i.e. when an algorithm is used to alter someone’s perception of a politically charged topic and/or when it is used to identify an individual’s suitability for employment based, for example, their sexuality or personality.

There is a very fine line between ‘soft’ selling desirable products and using data to start to assess or judge – especially if these judgements become the basis of critical decisions.

Alternate data in asset management

Individuals’ data is used as an input into the new breed of alternate (big) datasets that are becoming wildly popular in asset management.[1]

Individuals’ clicks, footsteps, shopping choices, tweets, texts, apps usage and emails are collected and used to measure consumer behaviour in the economy. Unlike direct marketing, alternate data vendors will aggregate this information to provide indications of broad consumer sentiment and the demand for specific products and services in the economy. Data sets on aggregate credit card transactions, measures of consumer sentiment derived from news and social media, counting foot-traffic in malls and holiday traffic using satellite information.

These datasets can then used by asset managers to make asset allocation decisions, forecast the future of sectors and the revenues of industries and companies. As an example, shows a comprehensive list of companies that supply alternate datasets and services for use in asset management. Of the 267 companies listed, we sought to flag those that were likely to contain individuals’ data, most often in aggregated form. [2]

In total, roughly 40% of the companies selling alternate data were based on some form of information on individuals. Data harvested from public social media and geo-location (where the phone records its location) were the most prominent users of this data, followed by credit-card transactions and App usage.

Seeking Permission

The Cambridge Analytica scandal is all about the subject of acquiring data without direct permission from users, user information that was allegedly ‘scraped’ through a large-scale data collection exercise.

Even with my ‘Star Wars’ app example, giving permission to a small app, I wouldn’t imagine that my data would serve a politically motivated algorithm. But then, I didn’t read the fine-print, I relied on an assumption that if my data was collected through a social media platform, it would be ‘contained’ within that medium.

That was a naïve assumption.

Data is perhaps the ultimate ‘soft’ commodity, and can be bought and sold to fuel AI models with any range of intentions from identity theft to innocent movie recommendations. Without strict controls over usage, and personalised permissions, it is nearly impossible to know how your data will be used and by whom; as such, the default position may be to restrict all usage of personal data. But without a ready supply of personal data, many of the otherwise insightful work of machine learning models into human behaviour, language, decision making and commerce will be severely limited.

Restricting personal data is also difficult because it relies on political boundaries to make those restrictions effective. For example, data collected in India may be legal within that jurisdiction to be analysed, but it may be illegal to use it for research purposes within the US. This could create a strange data black market, or simply force data vendors to ‘anonymise’ datasets to be compliant with local regulation.

The demand models built on personal data is only likely to grow. In a highly competitive industry like asset management, truly unique investment insights are extremely valuable and hard to come by. Fast-moving hedge funds, seeing their fees under siege are rapidly embracing alternate data sources to create value for their clients[3], and often concern themselves with the usefulness of the data above its source. But given the recent news flow, as an industry that is embracing the principles of environmental, social and governance criteria (or ‘responsible investing’), how much should asset owners and asset managers worry about their data sources?

Should individuals have a say in what it is used for? Is a dataset that is created from volunteer contributors versus one that is scraped or taken without explicit concept more ethical? Could we be witnessing the beginnings of something like an ‘ethically sourced’ dataset?