In October 2018 I submitted my dissertation at City, University of London. This was the last project in order to obtain my Master degree in MSc Data Science.
This research project aims to predict stock price trends (buy or sell) of FTSE 100 companies after an item of news has been released by explaining stock returns using sentiment analysis. In doing so, this project uses two different state-of-the-art Deep Neural Networks: Long-Short- Term Memory and Convolutional Neural Networks as well as a novel approach to determine the ground truth of the sentiment. Both models use pre-trained word embeddings and news headlines as input data to analyse the sentiment of a piece of news. Contrary to other academic projects, this thesis only considers news headlines instead of full-text data, which increases the challenge to get significant results with only limited data. Predicting stock price movements is always challenging. The Efficient Market Hypothesis theory suggests that no model can predict stock price returns with an accuracy above 50% since the stock market is efficient and follows a random walk.
The results are compared with a baseline model and demonstrate the predictive power of Neural Networks. Whereas a standard bag-of-words-approach with a conventional machine learning algorithm did not have any predictive power, the Deep Neural Network models achieved an accuracy of nearly 60%. When breaking down the input data into industry- and company-level, the models classified the news correctly with an accuracy of 66%. This thesis does not only suggest that Deep Neural Networks are better than conventional machine learning models, but also that the selection of the input data is very important since training the data with sector- specific news increased the accuracy significantly.
Introduction of this Neural Network coursework (Coding: Matlab) :
The Taiwanese economy faced a major credit card debt crisis during 2006. Many banks wanted to increase their market share, and in doing so, they provided credit cards to unqualified customers (Chou, 2006). Consequently, many customers defaulted and consumer finance confidence decreased heavily. Furthermore, one of the primary drivers’ of 2008 financial crisis were granted loans to people whose risk profile was too high (Charpignon, Horel and Tixier, 2012). To prevent a high number of defaults, which could cause a financial crisis again, it is important to develop models which assist credit managers in determining the risk of customers.
The purpose of this research paper is to critically evaluate two neural computing models which classify customers based on their personal information (such as age, marital status or education) into two binary states, namely default and non-default. Predicting these two states, helps banks to increase their market share and prevent the loss which happens when customers default. The chosen models are Support Vector Machines (SVM) and Neural Network (NN). Many papers already discussed the relationship between customer’s personal information and their default behaviour, suggesting that there is a strong relationship (Lu, Wang and Yoon, 2017).
This paper is organised as follows. Section 2 provides a critical review of SVM and NN techniques. Section 3 delivers information about the used dataset. Section 4 describes the methodology for this research. The results and critically evaluation are presented in Section 5. Finally, this paper ends with a conclusion in Section 6.
In the Big Data Module Benedikt W. and I built a financial product recommendation Tool with PySpark. To recommend products, we developed a random forest model with over 90% accuracy.
Lessons learned and future work:
Data imputing important to not lose too many observations
– Recommendations by Age or Gender could raise potential ethical issues
– Grid-Search is computationally intense but did not result in a significant performance increase
– Only considering personal data without looking at the product portfolio made our model perform worse
– Results showed that prediction for imbalanced datasets i.e. sparse product distribution is problematic
– The accuracy of unpopular products will automatically be high when the model solely not recommends a product
– Future work could tackle this sparsity using SMOTE
We also have a poster of our work, contact me if you are interested.
The Information Retrieval module coursework goal was to evaluate online and web search. In doing so, we needed to build a facet analysis, search strategy and evaluate different online and web search engines.
Title: The use of technology in sports. Description: Identify documents and websites that are related to the use of technology in sports and discuss how technology can improve strategy, efficiency and/or effectiveness. Narrative: The focus of this topic lays on how technology is used in sports and how it makes sports competition more competitive. A relevant document can discuss several sports but has to describe the use of technology and how it improves competition. Documents and websites that only discuss general usage of technology in sports are not relevant (e.g. increase social media presence to sell more tickets for a game).
During my “Data Visualization” course I built an interactive tool to explore the German Bundesliga. Coding: R, Main packages: Ggplot and Shiny.
In football, the players playing in one league are very divers. Although they share the same sport and play in the same country, they do not have the same background. They all have different skills, ages and nationalities. Moreover, football players occupy different positions on the pitch. Thus, despite practising the same sport, they have different abilities and skills. A visualisation of the German Football League (Bundesliga) helps to understand the underlying data. In doing so, the football dataset can answer several research questions:
– How do the different football positions (Goalkeeper, Defender, Midfielder, Forward) differ from one another?
– What are the key drivers for the player’s football performance or wage? – How do different clubs differ from another in terms of nationalities?
– Which are the best players for every position?
It can furthermore:
– Find players according to different criteria a club needs to know (e.g. young or cheap
– Search for similar players and show their differences
Contact me if you are interested in this code or want a video demonstration.
Taxis and especially Yellow Cabs are an important part of the city of New York. Since the early 20th century already Yellow Cabs are driving in the city (NPR, 2007). In May 2011, Uber announced that it will also start operating in New York (Uber, 2011). However, like many other cities, New York is currently facing severe traffic congestion. This year, New York’s mayor released a plan to reduce problems related to car traffic (Nir, 2017). To improve urban mobility, it is really important to understand when and how people travel within the city. Therefore, New York’s transportation companies have to publish their data (Flegenheimer, 2015).
Public transportation datasets can help to understand urban mobility and help to plan transit service, and thus a city can improve their bus routes or add bicycle lanes if necessary (Li, 2016). This report, analyses Yellow Cab data from January to March 2015 which is available at the New York City official homepage (NYC, 2015). In addition, a dataset from Uber covering the same period is used to have a more integral understanding of New York’s taxi landscape (FiveThirtyEight, 2016). The Yellow Cab dataset provides much more information than the one from Uber, which is why this report mainly focuses on the Yellow Cab dataset. The Uber dataset only highlights pickup dates and location, whereas the Yellow Cab dataset delivers more detailed information about pick-ups and drop-offs. The latter also reveals details about how many passengers travelled in a taxi. Therefore, the focus lays on the Yellow Cab data, but the Uber dataset is used as an add-on to compare both services. While the raw Yellow Cab dataset has over 38 million trips, Uber has nearly 6.5 million rides in the same period.
Taxi data analysis can help to improve the lives of citizens, political decisions and policy- (Ferreira et al, 2013). However, due to the size and complexity of the data, it is hard to perform comparative analyses. The goal of this report is therefore to get an overview of the behaviour of taxi trip in New York. By answering the questions when and where traffic occurs, it is pos- sible to find interactions between neighbourhoods. Consequently, this analysis detects where most traffic occurs and which routes are used most often. These insights reveal places where traffic caused by taxis can be reduced. Often, a high use of taxis implies poor public transport (Jiang et al, 2015). Therefore, the data helps to discover where this could be improved and thus reduce traffic congestion in cities.
The first part of this report investigates New York’s traffic behaviour using taxis’ GPS data. In a next step, more complex visualisation maps the data to highlight typical routes. Insights de- liver details where and when the most pickups take place. The dataset was also merged with weather data (NOAA, 2017). The weather data helps to understand if there are any relationships between taxi rides and weather conditions. To map the coordinates of the pick-ups and drop- offs, the dataset was furthermore merged with census data from New York (FCC, 2017). This dataset has the coordinates with the corresponding neighbourhood names.
This report was a Visual Analytics coursework during my MSc Data Science. Let me know if you want to read the entire report.
The groupwork assignment had the objective to create a classifier. My group built a classifier to predict credit card default with Random Forest and näive Bayes.
Default detection has always played a crucial part for banks in issuing credit cards
Defaults on credit card bills can result in major financial loss
Predicting individuals customers’ credit risk reduces financial damage and uncertainty
Card-issuing banks in Taiwan over-issued cash and credit cards to unqualiﬁed applicants
Cryptocurrencies are currencies or digital assets which do not depend on governments or central banks (Kurihara and Fukushima, 2017). The cryptocurrency market is a market which is not regulated and operates 24 hours, 7 days a week. Thus, this market is different from stock markets.
The cryptocurrency market has currently over 1300 different coins and tokens (Coinmarketcap, 2017). Coins can be considered a currency while tokens are more similar to stocks or assets. Tokens run on a platform provided by a blockchain. Nevertheless, all coins and tokens are considered cryptocurrencies, and many investors do not differentiate between both but view them as equal investments.
Two years ago, many of the current cryptocurrencies did not yet exist. The most famous cryptocurrency is Bitcoin which distributed its first coin in 2009. Many other cryptocurrencies however only emerged during Initial Coin Offerings (ICO) this or last year.
The cryptocurrency market is evolving very fast and becomes more and more mainstream. Consequently, it is interesting to look closer at this market since there is a research gap for this relatively young market currently. Although the market is this young, it is already worth over 400billion US Dollars (Coinmarketcap, 2017).
This report was written during my MSc Data Science course at the City University.