1 item tagged "data scientist"

  • 9 Tips to become a better data scientist

    9 Tips to become a better data scientist

    Over the years I worked on many Data Science projects. I remember how easy it was to get lost and waste a lot of energy in the wrong direction. In time, I learned what works for me to be more effective. This list is my best try to sum it up:

    1. Build a working pipeline first

    While it’s tempting to start with the cool stuff first, you want to make sure that you don't spend time on small technical things like loading the data, feature extraction and so on. I like to start with a very basic pipeline, but one that works, i.e., I can run it end to end and get results. Later I expand every part while keeping the pipeline working.

    2. Start simple and complicate one thing at a time

    Once you have a working pipeline, start expanding and improving it. You have to take it to step by step. It is very important to understand what caused what. If you introduce too many changes at once, it will be hard to tell how each change affected the whole model. Keep the updates as simple and clean as possible. Not only it will be easier to understand its effect, but also, it will be easier to refactor it once you come up with another idea.

    3. Question everything

    Now you have a lot on your hands, you have a working pipeline and you already did some changes that improved your results. It’s important to understand why. If you added a new feature and it helped the model to generalize better, why? If it didn't, why not? Maybe your model is slower than before, why’s that? Are you sure each of your features/modules does what you think it does? If not, what happened?

    These kinds of questions should pop in your head while you’re working. To end up with a really great result, you must understand everything that happens in your model.

    4. Experience a lot and experience fast

    After you questioned everything, you got stuck with… well, a lot of questions. The best way to answer them is to experiment. If you followed this far, you already have a working pipeline and a nicely written code, so conducting an experiment shouldn't waste much of your time. Ideally, you’ll be able to run more than one experiment at a time, this you’ll help you answer your questions and improve your intuition of what works and what is not.

    Things to experiment: Adding/removing features, changing hyperparameters, changing architectures, adding/removing data and so on.

    5. Prioritize and Focus

    At this point, you did a lot of work, you have a lot of questions, some answers, some other tasks and probably some new ideas to improve your model (or even working on something entirely different).

    But not all these are equally important. You have to understand what is the most beneficial direction for you. Maybe you came up with a brilliant idea that slightly improved your model but also made it much more complicated and slow, should you continue with this direction? It depends on your goal. If your goal is to publish a state-of-the-art solution, maybe it is. But if your goal is to deploy a fast and descent model to production, then probably you can invest your time on something else. Remember your final goal when working, and try to understand what tasks/experiments will get you closer to it.

    6. Believe in your metrics

    As discussed, understanding what’s working and what is not is very important. But how do you know when something works? you evaluate your results against some validation/test data and get some metric! You have to belive that metric! There may be some reasons not to believe in your metric. It could be the wrong one, for example. Your data may be unbalanced so accuracy can be the wrong metric for you. Your final solution must be very precise, so maybe you’re more interested in precision than recall. Your metric must reflect the goal you’re trying to achieve. Another reason not to believe in your metric is when your test data is dirty or noisy. Maybe you got data somewhere from the web and you don’t know exactly what’s in there?

    A reliable metric is important to advance fast, but also, it’s important that the metric reflects your goals. In data science, it may be easy to convince ourselves that our model is good, while in reality, it does very little.

    7. Work to publish/deploy

    Feedback is an essential part of any work, and data science is not an exception. When you work knowing that your code will be reviewed by someone else, you’ll write much better code. When you work knowing that you’ll need to explain it to someone else, you’ll understand it much better. It doesn't have to be a fancy journal or conference or company production code. If you’re working on a personal project, make it open source, write a post about it, send it to your friends, show it to the world!

    Not all feedback will be positive, but you’ll be able to learn from it and improve over time.

    8. Read a lot and keep updated

    I’m probably not the first one suggesting keeping up with the recent advancement to be effective, so instead of talking about it, I’ll just tell you how I do it: good old newsletters! I find them very useful as it’s essentially someone that keeps up with the most recent literature, picks the best stuff and sends it to you!

    9. Be curious

    While reading about the newest and coolest, don't limit yourself to the one area you’re interested in, try to explore others (but related) as well. It could be beneficial in a few ways. You can find a technique that works in one domain to be very useful in yours, you’ll improve your ability to understand complex ideas, and, you may find another domain that interests you so you’ll be able to expand your data skills and knowledge.

    Conclusion

    You’ll get much better results and enjoy the process if you’re effective. While all of the topics above are important, if I have to choose one, it will be 'Prioritize and Focus'. For me, all other topics lead to this one eventually. The key to success is to work on the right thing.

    Author: Dima Shulga

    Source: towards data science

EasyTagCloud v2.8