Data Science CoLab

Making Data Work for HHS

Introductory Level Modules Objectives

Students will have attained and applied skills to real-life problems that expose their understanding of how to clean and manipulate large datasets, combine data from different systems, make deeper insights and reach data-informed conclusions. This set of introductory topics may be deemed optional for participants with an advanced level of experience.

Introduction to open-source Programming and Analytics tools: Learn basic operations of integrated development environment (IDE), code maintenance tool (e.g., Git), and other administrative tasks. Use built-in Python data structures and external libraries (e.g., Pandas) to read from and write to files and manipulate data employing various Python data structures. Perform basic SQL commands using Python implementations (e.g., SQLite) and combine multiple data sources

Intermediate open-source Programming and Analytics tools: Perform web scraping and consume APIs and services. Perform string manipulations, calculations, and basic plotting (e.g., Matplotlib). Build functions consisted of above and perform basic software engineering tasks (e.g., debugging, unit testing). Understand basics of object-oriented programming (e.g., classes, inheritance, method override)

Predictive Analytics: Understand goals of data analytics (i.e., prediction, inference). Understand probability and statistics concepts to build linear regression models. Using Python built-in or external libraries (e.g., Scikit-learn), build linear and classification models by splitting, training, and testing on various datasets. Evaluate regression models using goodness of fit metrics (e.g., RMSE) and understand how to add or remove features/independent variables and how doing so affects models’ explanatory power. Evaluate classification models using performance metrics (e.g., prediction accuracy, sensitivity, specificity)

Data Visualization: Understand visualization theory. Understand basic HTML and Javascript required to perform advanced visualization. Using implementations such as D3, build visualization products at various stages of the workflow (e.g. data exploration, model building, evaluation)

Intermediate Level Modules Objectives

Students have attained and applied skills that expose their knowledge of how to efficiently train, test, and evaluate predictive models.

Introductory Machine Learning: Understand variance/bias tradeoff, overfitting, and other modeling considerations. Build more complex regression and classification models (Logistic Regression, Naïve Bayesian). Using Python implementations (e.g., textblob), perform text analysis and build classification models. Perform more advanced evaluation (e.g., ROC curve, AUC, cross validation)

Advanced Machine Learning: Build more advanced tree-based regression and classification models and employ techniques such as bootstrapping, boosting, random forest. Understand how to deal with unbalanced data and how multi-class classification models are built and evaluated. Using deep learning, support vector machines and other advanced techniques, build models on the same dataset and evaluate them. Perform more advanced natural language processing tasks. Understand how digital signals can be parsed and applied to modeling. Understand how unsupervised learning can be applied to different cases and build models employing unsupervised learning (e.g., principal component analysis).