Data science is a discipline that looks for patterns in complex datasets to build models that predict what may happen in the future and/or explain systems. Data science combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structured and unstructured data, frequently for business purposes. To perform data science well, you need a combination of skills and expertise.
Skills and Expertise Involved in Data Science
Without math and statistics knowledge, data science models can be misused, and results can be misinterpreted.
Data Science Lifecycle
There are frequently six phases in a data science lifecycle:
- Ideation: This is where the project is initially scoped, value/ROI is identified, and a go-no go decision is made about whether or not to invest in a data science project.
- Data acquisition and exploration: This is where promising data sources are identified, data are captured and explored, and then data are prepared (cleansed, combined, etc.).
- Research and development: Once the data are prepared, the team is able to test hypotheses, experiment with different models, and assess the results.
- Validation: This is a quality assurance phase, where a model is validated technically, ethically, and functionally by the business before it is put into production. Companies in highly regulated industries often have whole teams of people who are responsible for model validation, and in some cases, appoint a Chief Compliance Officer.
- Delivery: This is where the model is published by connecting to live data that’s used to make predictions. Web applications, integrated API(s), dashboards, reports, and even automated emails are popular form factors.
- Monitoring: Data and behavior change over time, so monitoring data drift, model accuracy, usage, performance, and value are important so problems can be identified early, and models can be improved.