Online: | |
Visits: | |
Stories: |
Story Views | |
Now: | |
Last Hour: | |
Last 24 Hours: | |
Total: |
As part of my Interview with Data Scientists project I recently caught up with Rosaria – who is an active member of the Data Mining community.
Bio: Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing.
She is currently based in Zurich (Switzerland).
There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because
the time is over or because the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects
and find something to improve in each of them!
This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic
deliverables in the project initial phase,
without which the project is not satisfactorily completed.
But once you have passed these deliverable
milestones, when do you stop?
What is the right compromise between perfection and resource investment?
In addition, every few years some new technology becomes available which could help
re-engineering your old projects, for speed or accuracy or both.
So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case
of the new big data platforms. Most of my old projects would benefit now from a
big data based speeding operation. This could help to
speed up old models training and deployment, to create more complex data analytics models, and to optimize model paramters better.
students in the Sciences?
Use your time to learn! Data Science is a relatively new discipline that combines old
knowledge, such as statstics and machine learning, with
newer wisdom, like big data platforms and parallel computation. Not many people
know everything here, really! So, take your time to learn what you
do not know yet from the experts in that area.
Combining a few different pieces of data science knowledge probably
makes you unique already in the data science landscape.
The more pieces of different knowledge, the bigger of an advantage for you
in the data science ecosystem!
One way to get easy hands-on experience on a different range of application fields is to
explore the Kaggle challenges
Kaggle has a number of interesting challenges up every months and who knows
you might also win some money!
This answer is related to the previous one, since my advise to
young data scientists sprouts from my earlier experience and failures.
My early background is in machine learning.
So, when I moved my first steps in the
data science world many years ago, I thought that knowledge of
machine learning algorithms was all I needed. I wish!
I had to learn that data science is the sum of many different skills,
including data collection and data cleaning and transformation. The latter,
for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources!
Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of
a clear and compact representation! And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.
Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize
that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH (Data Warehouse) solution is definitely enough for them.
Sometimes though, a big data solution is really needed or at least it will be needed
Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and
presentation is transveral to a number of application fields. You would be surprised at how many different applications can be designed using a variation of the same
data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.
I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also
of the customer. This is a crucial phase, though short.
During this part, in fact, I can take the measures of the project in terms of needed time and resources, and
I and the customer we can study each other and adjust our expectations about input data and final results.
This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to
produce the requested results.
Once this part is successful and expectations have been adjusted on both sides, the real project can start.
about this?
Ah … I am really not a very good example for dealing with
stakeholders and executives and successfully manage cultural
challenges!
Usually, I rely on external collaborators to handle this part for me, also because of time constraints.
I see myself as a technical professional, with little time for talking and convincing. Unfortunately, because this is a big part of each data analytics project.
However, when I have to deal with it myself,
I let the facts speak for me: final or
intermediate results of current and past projects.
This is the easiest way to convince stakeholders that the project is worth the time and the money.
For any occurrence, though, I always have at hand a set of slides
with previous accomplishements to present to executives if and when needed.
My latest project was about anomaly detection in industry. I found it a very interesting problem to solve, where skills and expertise have to meet creativity.
In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen.
What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing.
The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in.
Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.