We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

2026 Summer Intern - ECD Clinical Insight and Automation AI Modeling

Genentech
United States, California, South San Francisco
Feb 03, 2026
The Position

2026 Summer Intern - ECD Clinical Insight and Automation AI Modeling

Department Summary

Within the Clinical Insight and Automation (CI&A) team of the Early Clinical Development (ECD) department at Roche/Genentech, we develop quantitative and AI-driven methods that accelerate study design, evidence generation, and decision-making. This internship will contribute to an applied research effort on generative modeling and causal inference for creating high-fidelity synthetic clinical data, with an emphasis on producing a reproducible outcome by leveraging conditional deep generative models and open-source foundation models; as well as an use case for an AI-Driven Root Cause Analysis for Clinical Data Queries solution.

This internship position is located in South San Francisco, on-site.

The Opportunity

We're seeking a PhD student who is excited to pursue publication-quality machine learning research at the intersection of generative modeling, causal inference, and healthcare data. In this role, you will collaborate with scientists, analysts, and engineers to develop and rigorously evaluate novel methods and benchmarking protocols, with the goal of accompanying reproducibility artifacts.

Key Responsibilities:

  • Identify and prepare appropriate clinical datasets, define generation targets, and implement reproducible preprocessing pipelines.

  • Develop and compare modern generative modeling approaches for patient-level outcomes and trajectories (e.g., diffusion, transformer, and latent-variable models) conditioned on baseline covariates and study design assumptions; and unsupervised clustering/topic models to identify clinically meaningful patterns.

  • Incorporate causal inference considerations (confounding control, covariate balance, estimands) and quantify how synthetic controls impact downstream treatment-effect estimation; and perform manual "gold standard" labeling to create high-quality training datasets.

  • Design rigorous evaluation protocols for fidelity and utility, including distributional similarity, calibration/uncertainty, fairness and subgroup robustness, and privacy-risk checks, with ablations and sensitivity analyses.

  • Build end-to-end experiment infrastructure (training and evaluation scripts, configuration management, and experiment tracking) to support reproducibility and efficient iteration.

  • Co-prepare a conference-quality manuscript, figures, and supplementary materials, including the paper checklist and (where appropriate) anonymized code/data artifacts consistent with reproducibility and ethics expectations.

  • Communicate progress through regular updates; deliver a final technical report, curated repository, and presentation to cross-functional stakeholders.

  • Develop and execute Python scripts to ingest, clean, and normalize large volumes of unstructured clinical query text and patient-level datasets.

  • Help translate technical AI findings into a "Recommendations Matrix" that suggests specific site training or system improvements for stakeholders.

Program Highlights

  • Intensive 12-weeks, full-time (40 hours per week) paid internship.

  • Program start dates are in May/June 2026.

  • A stipend, based on location, will be provided to help alleviate costs associated with the internship.

  • Ownership of challenging and impactful business-critical projects.

  • Work with some of the most talented people in the biotechnology industry.

Who You Are

Required Education
You meet one of the following criteria:

  • Must be pursuing a PhD (enrolled student).

Required Majors: Computer Sciences, Artificial Intelligence, Computational Sciences, or a related field with a focus on machine learning systems or similar.

Required Skills:

  • Programming proficiency in Python; hands-on experience with PyTorch and scientific computing libraries (NumPy, Pandas).

  • Experience developing and training deep learning models, with familiarity in modern generative modeling (e.g., diffusion models, VAEs, autoregressive/transformer models).

  • Strong understanding of statistical machine learning and experimental methodology (ablations, error analysis, and appropriate statistical evaluation).

  • Foundational understanding of causal inference and counterfactual reasoning (e.g., confounding, estimands, treatment-effect estimation) and how these considerations interact with modeling choices.

  • Demonstrated technical writing skills (e.g., research reports or papers); comfort preparing conference-style manuscripts in LaTeX and presenting results to technical audiences.

  • Commitment to reproducible research and software engineering best practices (version control, documentation, experiment tracking), with the ability to package artifacts for peer review; collaborative communication skills.

Preferred Knowledge, Skills, and Qualifications

  • Excellent communication, collaboration, and interpersonal skills.

  • Complements our culture and the standards that guide our daily behavior & decisions: Integrity, Courage, and Passion.

  • Prior publication or strong experience preparing submissions for top-tier ML venues (e.g., NeurIPS/ICML/ICLR), including familiarity with reproducibility expectations (paper checklist, artifact preparation).

  • Experience working with healthcare/biomedical datasets (EHR, claims, or clinical trial data); familiarity with data standards (OMOP, FHIR) is a plus.

  • Knowledge of synthetic data evaluation and privacy risk assessment (e.g., memorization tests, membership inference, differential privacy).

  • Familiarity with causal ML topics such as causal representation learning, domain adaptation, and externally controlled trial methodology.

  • Experience with scalable training environments (GPUs, distributed computing) and modern ML tooling (Docker, experiment tracking platforms).

  • Experience leveraging foundation models (LLMs) or structured prompting to incorporate domain knowledge into ML workflows is beneficial, but not required.

  • Ability to query and extract data from relational databases using SQL and hands-on experience with NLP frameworks and clustering algorithms.

  • Familiarity with clinical trial operations, Electronic Data Capture (EDC) systems, or the regulatory landscape of the pharmaceutical industry is beneficial, but not required.

Relocation benefits are not available for this job posting.

The expected salary for this position based on the primary location of California is $50.00 per hour.Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law. This position also qualifies for paid holiday time off benefits.

Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws.

If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants.

Applied = 0

(web-54bd5f4dd9-d2dbq)