ML strategy

In this post, I covered:

How to spend your time

Let’s say if you are designing a spam classifier, how could you spend your time to improve the accuracy of this classifier?

But it is difficult to tell which of the options will be most helpful.

Chains of assumptions in ML

  1. Fit training set well on cost function
    • Bigger Network
    • Better optimization algorithm
    • Different NN architecture/hyperparameters searching
  2. Fit dev set well on cost function
    • Regularization
    • Bigger training set
  3. Fit test set well on cost function
    • Bigger dev set (Maybe over-tuned on dev set)
  4. Performs well in real world
    • Change dev set or cost function

Data For Machine Learning

It’s not who has the best algorithm that wins. It’s who has the most data.

Metrics for ML algorithm

Think outside of the box, don’t be constricted to a single specific error metric, define the metric that well captures what you really want to do.

Apply orthogonalization to this problem:

Worry separately about how to do well on one metric.

Single number evaluation metric

It’s really a trade off between precision and recall.

There is a metric call “F1 score”, which is basically the “harmonic mean” of p and r.

\[ score = \frac{2}{\frac{1}{p} + \frac{1}{r}} \]

2 Tips to quickly select the better algorithm:

  1. Use a well defined Dev set.
  2. Use the single real number metric.

Satisficing and Optimizing metric

Take accuracy and running time for example, you want to maximize accuracy subject to runningTime ≤ 100 ms.

In this case, accuracy becomes the optimizing metric while running time becomes the satisficing metric.

Setting up Train/Dev/Test distributions

Dev/Test sets

Make sure your dev set and test set come from the same distribution. Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

Old ways of splitting data:

70% train - 30% test

60% train - 20% dev - 20% test

But as for very large amount of data, the following distribution will become reasonable:

98% train - 1% dev - 1% test

Set your test set to be big enough to give high confidence in the overall performance of your system.

Comparing Human-Level performance

Why Human-Level?

As long as ML is worse than humans, you can:

Bayes Optimal Error

Nothing can surpass Bayes error.

** But we can think Human-level error as a proxy(estimate) for Bayes error.** Thus we can know what the avoidable bias is.

Denote avoidable bias as \( bias \), algorithm variance as \( variance \), human-level error as \( h \), current training error as \( etrain \), current dev error as \( edev \):

\[ bias = etrain - h \] \[ variance = edev - etrain \]

Error Analysis

Error Analysis help you evaluate whether it worth your effort to focus on a specific source that causes the error of your algorithm.

Intuition:

which can also be applied to evaluating multiple ideas in parallel - calculating the percentage that one specific source accounts for among the total error.

For example, you can analyze the following error sources when developing a cat recognition algorithm:

Cleaning up incorrectly labeled data

DL algorithms are quite robust to random errors while they are relatively vulnerable to systematic errors(e.g. incorrectly labeled data).

First step usually is to carry out a Error Analysis to determine whether it is worth doing so.

Once decide to correct the incorrect examples, remember to:

Handling Skewed Classes

Take cancer diagnosis as a example, if we predict y = 1, we predict the patient has cancer:

Error Metrics for skewed classes

Precision: Of all patients where we predicted \( y = 1\), what fraction actually has cancer.
Recall: Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?

Trading off Precision and Recall

For logistic regression, we can change the decision boundary to make the cancer diagnosis more confident(i.e. we only predict y = 1 when we are very confident.).

In this case, F1 score will become a option.

Tips for Building a new DL system

Build your first system quickly, then iterate. Don’t over-think the problem and build the system too complicated. Build something that actually works.

  1. Set up dev/test set and metric.
  2. Building initial system quickly.
  3. Use Bias/Variance analysis & Error analysis to prioritize next steps.

Mismatched training and dev/test set

For example, you are building a cat recognition system for mobile apps, and you have 200,000 images from the web and only 10,000 data from the mobile apps.

Options:

Training-dev set: Same distribution as training set, but not used for training, which can be used to determine whether we should focus on “Avoidable Bias Problem” or “Variance Problem” or “Data Mismatch Problem”.

We can use “artificial data synthesis” or “data augmentation” to generate more data than currently we have to train your algorithm, but be mindful of synthesizing only a tiny subset of the overall possible space, which can cause “overfitting” to the synthesis data.

Transfer Learning

Transfer learning makes sense when:

End-to-End Deep Learning

Instead of going through a pipeline, end-to-end deep learning bypass all the middle details and directly output the results, which requires a large amount of data.

Concise as it is, sometimes it is better to use step-by-step pipeline method, which can make the problem more clear.

Pros:

Cons:

Sometimes the categories or so called steps are just a platonic form, which can somehow largely limit the possibilities of the output.(for the layering reduces your freedom of choice)

· 机器学习