Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
In the Data Science literature, we can find quite a few articles that describe how to do exploratory data analysis (EDA) from a technical point of view. However, usually, there is no information on where to get inspiration for making hypotheses in such an EDA.
That is why in this post, I would like to share my thoughts on how to approach searching for such inspiration. As always, I will be relying heavily on my own experiences. If you have any ideas that I haven’t included here, be sure to let me know!
Domain experts and the client
If you are creating a solution for an external client, he will be the first and most valuable source of hypotheses. If he comes to you with an idea for a certain algorithm, he probably already has some thoughts related to this topic. He might not be conscious of it though, so your task will be to skillfully ask questions that will allow you to discover your client’s needs.
Before the project
Let’s assume that you are to analyse bank customer data to build a model assessing their credit score. Before you start your analysis, it will be good to spend a few hours talking to a person representing the bank. With their help, you can understand the expectations and where the idea for such an algorithm came from.
- Maybe many years of the bank’s work experience have already revealed some patterns or dependencies that led to the idea for complete automation?
- What is the main problem to be solved by the algorithm?
- Why the current solution is not enough?
Also, get to know how such an algorithm should work. What do you think the client should pay attention to the most? Which data will be crucial, and what dependencies between them should the model be able to detect to return the correct answer?
You also need to determine how the credit score is assessed at this point. When talking to specialists who deal with this on a daily basis, you can ask what factors they take into account and what rules they follow. Which cases are easy and typical, and which are difficult? How often do such difficult cases occur, and what are the reasons for them?
Finally, ask directly – what is not known at the moment and what the hypotheses are. What would the client want to learn from the data? Are there any issues that have been bothering him for a long time?
Each question can give rise to many different hypotheses, which you will then test.
Note that the questions are so universal that they should also work in other fields, such as healthcare, where your system is supposed to support the work of doctors. In this new scenario, they will be the specialists from whom you need to gather domain knowledge.
If you are building your own product based on your own idea, you can ask yourself these questions. What intuition do you have about the world around you that tells you that your solution has a chance to work? How could you refine this intuition to make specific hypotheses?
During the project
In Data Science, it is very important to have constant contact with the client, preferably with domain experts, during the project. Not only before starting the project. Thanks to this, you can present the results of your work and collect them in ongoing feedback.
You can often find out that your analyses only confirm what was known (this is good news – it usually means that the analysis has been performed well). Still, you can also bring valuable observations that show something new. It will be very difficult for you to assess this without consulting experts.
Such conversations allow us to jointly propose further hypotheses based on what has already been established.
Determining the state of knowledge and SOTA algorithms in a given field
This means conducting thorough research to learn about:
- Has anyone already done similar things?
- What were the solutions used?
- What has been achieved and what has not been achieved?
- What observations are repeated, and where are the discrepancies?
If you observe a generally accepted opinion in a given field, it is worth making a hypothesis and testing it against your data. In the example discussed here at the beginning, it may be a trivial hypothesis, e.g. whether people with higher earnings are more punctual in paying loan instalments.
Keep in mind, however, that in the more advanced literature, these basic assumptions are often overlooked because they are taken for granted. Therefore, when examining the reasoning presented in any article, pay close attention to the tacit use of certain assumptions. There may also be a problem with the fact that the best algorithms in some industries have an implicit mode of operation, e.g. an AI-based investment assistant. In these cases, you especially need the ability to read between the lines.
It is worth noting that already at this stage, you can come up with very interesting conclusions that may not be entirely in line with what is commonly believed in a given field. Then it is worth taking a closer look at them – maybe you are discovering previously unknown areas?
Also, consider aspects where there is no agreement in the literature. If some people say the world works this way and others say it doesn’t, you have the perfect hypothesis to test against your data.
Initial data analysis
If you already have a dataset to work with, you will definitely start by performing an initial, general analysis (if it is tabular data, I recommend using the Pandas Profiling library, which I described in this post).
Such an initial glance will show you what features we have, how diverse they are, and the basic relationships between them, such as correlations. When looking at this general information, you will usually want to check various additional things that can be a precursor to further hypotheses, e.g.:
- Why is there a correlation between certain features?
- Why are some features less and others more diverse?
- What do NULLs in data mean?
- Where do the outliers come from?
Even if you are not an expert in a given field, you can make many hypotheses based on your own experience of observing the functioning of the world, economy or human behaviour.
For example, you don’t have to be a pharmacist to figure out that there will be a strong seasonality in the sale of some drugs, and it is worth taking this factor into account if you are building a model to predict the volume of drug sales.
I think that the most effective form of working on a Data Science project is working in a team of several people. The most effective form of cooperation is when everyone gets a task for more or less one day, and you have daily statuses to verify the progress.
Such meetings are also an opportunity to discuss the conclusions reached so far and brainstorm the next steps. There is a good chance that it will be much easier to make further hypotheses together than when you try to do it alone.
If the nature of your work requires independent projects, at least try to have a person with whom you can discuss your progress at least once a week, show the results and think together about what you can do next. One of the worst things about Data Science is working on your own for weeks on a certain dataset, with no opportunity to discuss what you’re doing. It may sound brutal, but in such a situation, there are many indications that your work will be wasted and will not be useful to anyone because you have no way to verify whether what you do meets certain business assumptions.
Use an agile approach
Surely you are familiar with the concept of agile (if not, you can read about it, e.g. here).
I have already written about the need for regular cooperation with the client, which is part of this philosophy. However, when the purpose of your data analysis is to build a model, it is very important to be able to deliver a working version of your solution as soon as possible and confront it with the target users. Believe me, there will be problems that you will never run into when testing the model in a laboratory.
This may go a bit further than the EDA, but if the goal is to build a practical solution, remember that reality will quickly verify your analysis. Hence there is no point in testing hundreds of hypotheses. Make sure that the model already gives some value, and move on to real tests (preferably in the form of A/B tests, which you can read about in my blog post).
Remember that the verification of hypotheses is the foundation of Data Science. It is necessary both in scientific research and in applied Machine Learning model implementation. Therefore, if you want to create one or the other successfully, you must be able to approach both hypothesising and verifying them in a smart way.
In this article, I showed you 6 ways that can be useful in making hypotheses. I hope at least some of them are new to you. If you have any ways that I haven’t written about, I’m waiting for a DM from you!