How to win a data science competition

Course goals
Week 1
  • Intro to competitions & Recap
  • Feature preprocessing & extraction
Week 2
  • EDA
  • Validation
  • Data leaks(what to, how to, how to find it, data leaks exploration)
Week 3
  • Metrics
  • Mean-encodings
Week 4
  • Advanced features(T-sne)
  • Hyperparameter optimization
  • Ensembles(Bagging, stacking)
Week 5
  • Final project
  • Winning solutions

Competitions’ concepts
Model:(feature transform included)
Submission: Usually you are asked to submit on submission
Evaluation: How good your model is (predictions, right answers) -> scroe
  • Accuracy
  • Logistic loss
  • AUC
  • RMSE
  • MAE
  • PB
  • LB
  • analyze data
  • Fit model
  • submit
  • See public score
Real Life v.s Competitions
Real World ML Pipeline
It’s a complicated process included:
  1. Understanding of business problem
  2. Problem formalization
  3. Data collecting
  4. Data preprocessing
  5. Modeling(which label is fit for model)
  6. Way to evaluate model in real life
  7. Way to deploy model
Things we need to care about:
  • Real-world problems are quite complicated
  • Competition are are great way to learn
  • But they don’t address the questions of formalization, deployment and testing
Philosophy of competitions
  • It is all about data and making things work, but not about algorithms itself
            - Everyone can and will tun classic approaches
            - We need some insights to win
  • Sometimes there is no ML
Do not limit yourself
  • Heuristics
  • Manual data analysis
Do not be afraid of:
  • Complex solutions
  • Advanced feature engineering
  • Doing huge calculation
Be creative:
  • It’s OK to modify or hack existing algorithms or even to design completely new algorithm
  • Do not be afraid of reading source codes and changing them

Families of ML algorithms
  • Linear model
  • Tree-based :Decision Tree, Random forest, GBDT
  • :RF
  • :GBDT
  • kNN-based methods
  • Neural Networks
  • No Free Lunch Theorem
  • Conclusion
  • There is no ’silver bullet’ algorithm.
  • Linear models split space into 2 subspaces.
  • Tree-based methods splits space into boxes.
  • k-NN methods heavy rely on how to measure points ‘closeness’.
  • Feed-forward NNs produce smooth non-linear decision boundary.
  • The most powerful methods are GBDT and NN, but you shouldn’t underestimate the others.

Hardware/Software setup
Most of competitions (expect image-based) can be solved on:
  • High-level laptop
  • 16+gb ram
  • 4+ cores
Quite good setup:
  • Tower PC
  • 32+gb ram
  • 6+ cores
RAM:keep data in memory-everything will be much easier
Cores:faster experiments you can du
Storage:SSD is crucial if you work with huge image data set
Cloud resources:
  • Amazon AWS(spot option!)
  • Microsoft Azure
  • Google Cloud
Most of competitors use Python data sciences software stack
Basic stack:
  • numpy
  • pandas
  • Scikit-learm
  • matplotlib
  • IPython
  • Jupyter
Special packages:
  • xgboost
  • lightgbm
  • keras
  • Tsne
External tools:
  • libfm
  • libffm
  • Fast_rgf
  • Anaconda works out-of-box
  • Proposed setup is not the only one, but most common
  • Don’t overestimate role of hardware/software

Feature preprocessing and generation with respect to models
Main topics:
  1. Feature preprocessing
  2. Feature generation
  3. Their dependence on a model type
Features: numeric, categorical, ordinal, datetime, coordinates
Missing values.
We need to care about different features have different types?
  • Strong connection between preprocessing in our model
  • Common feature generation methods for each feature type
Feature preprocessing
Feature processing depends on model we are going to use
假设特征和标签之间存在非线性关系, 线性模型在这里不适用. 为了提升线性模型的预测效果,我们使用 onehot 编码
和之前的编码方式比,这种编码方式有更好的表现.但是对于 RF 来说,不需要 onehot编码
Feature generation
Understanding model will help us to create useful feature
预测下周苹果销量, 我们有历史几个月的销量数据作为训练集,我们人为数据有非线性关系.
一种帮助模型捕捉非线性信息的方法是添加一个”过去的天数”的特征, 线性模型能够根据这个特征捕捉非线性关系
Feature generation:
  • Feature preprocessing is ofter necessary
  • Feature generation is powerful technique
  • Preprocessing and generation pipelines depend on a model type

Numeric features
  • Preprocessing
    • Tree-based models
    • Non-tree-based models
  • Feature generation
Different scaling results different model quality(Hyper parameter you need to optimize)
  • 0-1 归一化处理 好处是数据的分布没有变化 
  • 标准化处理(mean=0, std=1) 
  • outliers(孤立点对模型有较大的影响 线性模型常用)
经常用于金融数据(也称:linearization) 1% 和 99%
  • Rank
  • Log transform
  • Raising to the power < 1
np.sqrt(x + 2/3)
Feature Generation
Ways to proceed:
  • Prior knowledge
  • EDA
学习到人们对这个数值的感受, 此外也可以用次区分是人还是机器人(设置阈值)(由于 youtube 上面的视频是英语没有字幕,这里用英语记录不大方便)
  • Numeric feature preprocessing is different for tree and non-tree models
    • Tree-based models doesn’t depend on scaling
    • Non-tree-based models hugely depend on scaling
  • Most ofter used preprocessings are:
    • MinMaxScaler - to [0,1]
    • StandardScaler - to mean == 0, std==1
    • Rank - set spaces between sorted values to be equal
    • np.log(1+x) and np.sqrt(1+x)
  • Feature generation is powered by:
    • Prior knowledge
    • Exploratory data analysis

Categorical and ordinal features
Ordinal features:
  • Ticket class:1,2,3
  • Driver’s license:A,B,C,D
  • Education: Kindergarden school, undergraduate, bachelor, master, doctoral
The easiest way to preprocessing ordinal feature is to map it with number(Label encoding) 可以用 K-NNs 等..
Linear models will be confused...
一个问题:如果有大量的的标签,那么将会得出的值都差不多,效果会下降.我们要知道以上方法适合基于树的模型,对于线性模型问题很大.对于线性模型,我们可以使用 one-hot 编码
如果占用内存过大的话,我们必须考虑使用sparse matrix,(在内存中值存储非0项)
Categorical features
  • Values in ordinal features are sorted in some meaningful order
  • Label encoding maps categories to numbers
  • Frequency encoding maps categories to their frequencies
  • Label and Frequency encodings are ofter used for tree-based models
  • One-hot encoding is ofter used for non-tree-based models
  • Interactions of categorical features can help linear models and KNN

Datetime and coordinates
  • Periodicity
        Day number in week, month, season, year, second, minute, hour.
        Very simple, We can add features like second minute hour day in a week/month on the year.(捕捉数据的重复性)
  • Time since
        Row-independent moment
            For example: since 00:00:00 UTC, 1 January 1970;
            The number of days to these events
        Row-dependent important moment
            Number of days left until next holidays/time passed after last holiday
            Sales is target. Date is origin features, others are generated features
  •     Difference between dates
            Subtract one feature from another features(datetime_feature_1 - datetime_feature_2)
  • Interesting places from train/test data or additional data
  • Centers of clusters
  • Aggregated statistics
For tree-based method we can slightly rotated coordinates as new features, this will help a model make more precise selections on a map
We usually add all rotations to 45 or 22.5 degrees

Handling Missing Values
Minus 1:
Fillna approaches
  • -999, -1, etc
  • Mean, median
  • Reconstruct value
If we want to feature generation, We must be careful with missing numbers
Treating values which do not present in train data
We use the minimum frequency encoder to replace it.
Threating values which do not present in train data
  • The choice of method to fill NaN depends on the situation
  • Usual way to deal with missing values is to replace them with -999, mean or median
  • Missing values already can be replaced with something by organizers
  • Binary feature ‘isnull’ can be beneficial
  • In general, avoid filling nans before feature generation
  • Xgboost can handle NaN

Feature extraction from texts and images
Solely text competitions: Use search engines in order to find similar texts(was the case in the ALLEN INSTITUTE)
Images competitions:CNN
Common features + text/images
  • Bag of words
    • TF-idf
  • Embeddings(~word2vector)