One of my last post was "I don't do machine learning". Now, I realized that I am actually doing it (more on this below). However what I was doing previously? I would say I was doing data learning.
After reading a lot, I think that the differences are on the focus, a regression (substitute for other procedures that model something) can be used for both machine learning and data learning.
When I do a linear regression in data learning I focus on which variables have more weight and improve the adjusted R2. I learn from the data what is happening.
When I do machine learning I focus on which linear regression is better with the data I have (or with new data). I learn from the data what models are better. That model can be a simple regression or a complicated deep neural network.
Both have the same data, but the goal is different. Of course, nowadays where it is not too computationally expensive we want both: Once we select "the best" model we want to know what is happening.
As in machine learning the focus are on the models it is needed to check if the model generalize, using for instance cross-validation, bootstrapping, or validating it in another dataset.
We can see the differences on the recent case of Amazon changing the machine learning they used to select employers. They used machine learning to choose a model about how they picked their employers, but the data learning was deficient . They didn't understand the assumptions of the model until much later they started using it. When they finally look well into the model and discovered the assumptions it made is when they found the misbehavior of the model.
Note that the more complex a model is, it will be more difficult to understand from the data. Also some might say that machine learning is the new word for statistics (and it could say the same from data learning!), but statistics has two topics: description, inference. While in machine learning and data learning can be thought as having both topics, describing the data, and inferring new models/data points.
I am using RGCCA to find the relationship between gut's microbiome and transcriptome. I am trying to understand/learn what happens in this relationship. My labmates are interested in knowing which microorganisms are the cause of the Crohn's disease or which genes are the cause and should be followed/analyzed more. Knowing the genes they interact with would help to design a drug to cure the patients or at least alleviate their sympthons and prevent relapse and reincidence.
After analyzing two models I realized that I didn't know which model was "the best"! I simply assumed it, hence, now I am interested in how to model the data.
Now I have two new problems:
At first, (naively), I thought, lets try all the models of relationship and select "the best" model. However, this falls under the known over-fitting problem and I am now evaluating it by following a leave-one-out strategy.
Second, I have too many models! When I say all models I found that there are 55648 models available of a symmetric design matrix with 5 blocks of data (using only three weights per position). Evaluating all of them would require lot of CPU time, so I am now looking into ways to reduce it. Basically I plan to do a 10% of the total amount of possible combinations and then use a linear model to try to assess which model could be better, test another few thousands and check.
This also helps to identify the two aspects, I can select a perfect model for this data, but if I use it with other data my results will likely be invalid. So I need to validate both the model I use and the assumptions of the model.
Two activities can be defined about data: Learning about models (machine learning), learning about the data of the models (data learning). As seen in my case, machine learning doesn't appear out of the blue, but it appears naturally when you question your model (and yourself).
What is different in machine learning to methods?
After reading a lot, I think that the differences are on the focus, a regression (substitute for other procedures that model something) can be used for both machine learning and data learning.
When I do a linear regression in data learning I focus on which variables have more weight and improve the adjusted R2. I learn from the data what is happening.
When I do machine learning I focus on which linear regression is better with the data I have (or with new data). I learn from the data what models are better. That model can be a simple regression or a complicated deep neural network.
Both have the same data, but the goal is different. Of course, nowadays where it is not too computationally expensive we want both: Once we select "the best" model we want to know what is happening.
As in machine learning the focus are on the models it is needed to check if the model generalize, using for instance cross-validation, bootstrapping, or validating it in another dataset.
We can see the differences on the recent case of Amazon changing the machine learning they used to select employers. They used machine learning to choose a model about how they picked their employers, but the data learning was deficient . They didn't understand the assumptions of the model until much later they started using it. When they finally look well into the model and discovered the assumptions it made is when they found the misbehavior of the model.
Note that the more complex a model is, it will be more difficult to understand from the data. Also some might say that machine learning is the new word for statistics (and it could say the same from data learning!), but statistics has two topics: description, inference. While in machine learning and data learning can be thought as having both topics, describing the data, and inferring new models/data points.
Now lets dive a bit on my case:
I am using RGCCA to find the relationship between gut's microbiome and transcriptome. I am trying to understand/learn what happens in this relationship. My labmates are interested in knowing which microorganisms are the cause of the Crohn's disease or which genes are the cause and should be followed/analyzed more. Knowing the genes they interact with would help to design a drug to cure the patients or at least alleviate their sympthons and prevent relapse and reincidence.
After analyzing two models I realized that I didn't know which model was "the best"! I simply assumed it, hence, now I am interested in how to model the data.
Now I have two new problems:
At first, (naively), I thought, lets try all the models of relationship and select "the best" model. However, this falls under the known over-fitting problem and I am now evaluating it by following a leave-one-out strategy.
Second, I have too many models! When I say all models I found that there are 55648 models available of a symmetric design matrix with 5 blocks of data (using only three weights per position). Evaluating all of them would require lot of CPU time, so I am now looking into ways to reduce it. Basically I plan to do a 10% of the total amount of possible combinations and then use a linear model to try to assess which model could be better, test another few thousands and check.
This also helps to identify the two aspects, I can select a perfect model for this data, but if I use it with other data my results will likely be invalid. So I need to validate both the model I use and the assumptions of the model.
Summary
Two activities can be defined about data: Learning about models (machine learning), learning about the data of the models (data learning). As seen in my case, machine learning doesn't appear out of the blue, but it appears naturally when you question your model (and yourself).