Scalable ETL Pipeline for Health Data Ingestion: Application to the Brazilian Unified Health System (SUS)

aplicação ao Sistema Único de Saúde (SUS)

Authors

DOI:

https://doi.org/10.21728/asklepion.2026v5n1e-132

Keywords:

ETL, Health information systems, Data ingestion, Health data integration, Public health

Abstract

he growing availability of health data within the Brazilian Unified Health System (SUS) increases the potential for data-driven analysis. However, large datasets also introduce challenges related to volume, structure, and data integration. This study develops and evaluates an automated Extract, Transform, Load (ETL) pipeline for ingestion and preparation of data from the Ambulatory Information System (SIA-SUS). The architecture uses cloud computing infrastructure to support scalable processing. The study follows the Design Science Research approach, which focuses on the development and evaluation of technological artifacts. A pilot experiment processes data from January 2024 for three Brazilian states. The experiment includes approximately 3.2 million ambulatory records. Each execution runs five times to estimate operational variability. Results show stable pipeline performance across scenarios. The extraction stage accounts for the largest share of total execution time. Throughput remains relatively consistent despite differences in data volume. Linear regression between processed records and execution time produces a coefficient of determination of R² = 0.996. The result indicates an approximately linear relationship between data volume and processing time. The pipeline demonstrates operational feasibility and scalability potential. The architecture reduces the complexity of preparing large datasets from the SUS. The solution supports the development of analytical environments for public health data.

Downloads

Download data is not yet available.

References

ANTUNES, F. M. et al. Informação como apoio para tomada de decisão de gestores públicos de saúde. Revista de Administração em Saúde, [s. l.], v. 21, n. 82, 2021. Disponível em: https://cqh.org.br/ojs-2.4.8/index.php/ras/article/view/283. Acesso em: 18 fev. 2026.

APACHE. PyArrow - Python library for Apache Arrow. versão 23.0.0. [S. l.]: [s. d.], 2026. Disponível em: https://pypi.org/project/pyarrow/. Acesso em: 23 fev. 2026.

BARNES, B. J. et al. A regression-based approach to scalability prediction. In: ICS08: INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2008, Island of Kos Greece. Proceedings of the 22nd annual international conference on Supercomputing. Island of Kos Greece: ACM, 2008. p. 368–377. Disponível em: https://dl.acm.org/doi/10.1145/1375527.1375580. Acesso em: 9 mar. 2026.

BERISHA, B.; MËZIU, E.; SHABANI, I. Big data analytics in Cloud computing: an overview. Journal of Cloud Computing, [s. l.], v. 11, n. 1, p. 24, 2022.

BRASIL, M. da S. Manual Operacional do Boletim de Produção Ambulatorial. [S. l.], 2022. Disponível em: https://wiki.saude.gov.br/sia/index.php/BPA. Acesso em: 26 fev. 2026.

COELHO, F. C. PySUS. versão 1.0.1. [S. l.]: [s. d.], 2024. Disponível em: https://pypi.org/project/pysus/1.0.1/. Acesso em: 23 fev. 2026.

DI GREGORIO, F.; VARRAZZO, D. Psycopg2 - Python-PostgreSQL Database Adapter. versão 2.9.11. [S. l.]: [s. d.], 2025. Disponível em: https://pypi.org/project/psycopg2. Acesso em: 23 jun. 2026.

FIOCRUZ. Plataforma de Ciência de Dados aplicadas à Saúde. [S. l.], 2019. Disponível em: https://pcdas.icict.fiocruz.br/. Acesso em: 18 fev. 2026.

HENKE, E. et al. An Extract-Transform-Load Process Design for the Incremental Loading of German Real-World Data Based on FHIR and OMOP CDM: Algorithm Development and Validation. JMIR Medical Informatics, [s. l.], v. 11, p. 1–10, 2023.

JYOTI AGGARWAL. ETL pipelines for cloud-native data platforms: Architecting real-time analytics on integrated cloud services. World Journal of Advanced Engineering Technology and Sciences, [s. l.], v. 15, n. 2, p. 107–114, 2025.

KHATTACH, O.; MOUSSAOUI, O.; HASSINE, M. End-to-End Architecture for Real-Time IoT Analytics and Predictive Maintenance Using Stream Processing and ML Pipelines. Sensors, [s. l.], v. 25, n. 9, p. 2945, 2025.

KRISHNAPUR, P. K. et al. A Reproducible Python-Based Computational Pipeline for Real-Time Ingestion, Advanced Analysis, and Dynamic Reporting of Public Health Data: A Systems Validation Study. Cureus, [s. l.], 2026. Disponível em: https://www.cureus.com/articles/449538-a-reproducible-python-based-computational-pipeline-for-real-time-ingestion-advanced-analysis-and-dynamic-reporting-of-public-health-data-a-systems-validation-study. Acesso em: 10 mar. 2026.

LIU, X. Optimizing ETL Dataflow Using Shared Caching and Parallelization Methods. [S. l.]: arXiv, 2014. Disponível em: https://arxiv.org/abs/1409.1639. Acesso em: 10 mar. 2026.

MARTINS, P. et al. A performance study on different data load methods in relational databases. In: 2019 14TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2019, Coimbra, Portugal. 2019 14th Iberian Conference on Information Systems and Technologies (CISTI). Coimbra, Portugal: IEEE, 2019. p. 1–7. Disponível em: https://ieeexplore.ieee.org/document/8760615/. Acesso em: 9 mar. 2026.

NAKAGAWA, S.; SCHIELZETH, H. A general and simple method for obtaining R2 from generalized linear mixed‐effects models. Methods in Ecology and Evolution, [s. l.], v. 4, n. 2, p. 133–142, 2013.

NAMLI, T. et al. A scalable and transparent data pipeline for AI-enabled health data ecosystems. Frontiers in Medicine, [s. l.], v. 11, p. 1393123, 2024.

NOLL, S. et al. Shared Load (ing): Efficient Bulk Loading into Optimized Storage. 2020. CIDR. [S. l.]: [s. d.], 2020.

PAIM, J. et al. The Brazilian health system: history, advances, and challenges. The Lancet, [s. l.], v. 377, n. 9779, p. 1778–1797, 2011.

PEDREGOSA, F. et al. Scikit-learn: Machine Learning in Python. [s. l.], 2012. Disponível em: https://arxiv.org/abs/1201.0490. Acesso em: 9 mar. 2026.

PEFFERS, K. et al. A Design Science Research Methodology for Information Systems Research. Journal of Management Information Systems, [s. l.], v. 24, n. 3, p. 45–77, 2007.

REDDY GUJJALA, P. K. Optimizing ETL Pipelines with Delta Lake and Medallion Architecture: A Scalable Approach for Large-Scale Data. International Journal For Multidisciplinary Research, [s. l.], v. 6, n. 6, p. 55445, 2024.

SHAIK, B. PostgreSQL Configuration: Best Practices for Performance and Security. Berkeley, CA: Apress L. P, 2020.

SHIMAOKA, A. M. et al. Big Data na Saúde Pública: Análise do Ecossistema das Bases Epidemiológicas no Brasil: Big Data in Public Health: Analysis of the Epidemiological Database Ecosystem in Brazil. Revista de Epidemiologia e Saúde Pública - RESP, [s. l.], v. 3, n. 1, p. 167–177, 2025.

SILVA, V. J.; BONACELLI, M. B. M.; PACHECO, C. A. O sistema tecnológico digital: inteligência artificial, computação em nuvem e Big Data. Revista Brasileira de Inovação, [s. l.], v. 19, p. 1–31, 2020.

SOUIBGUI, M. et al. Data quality in ETL process: A preliminary study. Procedia Computer Science, [s. l.], v. 159, p. 676–687, 2019.

SUPABASE INC. Supabase. [S. l.], 2026. Disponível em: https://supabase.com/docs. Acesso em: 20 fev. 2026.

TORRES, D. R. et al. Aplicabilidade e potencialidades no uso de ferramentas de Business Intelligence na Atenção Primária em Saúde. Ciência & Saúde Coletiva, [s. l.], v. 26, n. 6, p. 2065–2074, 2021.

WOJCIECHOWSKI, A. E-ETL: framework for managing evolving etl processes. In: CIKM ’11: INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2011, Glasgow Scotland, UK. Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management. Glasgow Scotland, UK: ACM, 2011. p. 59–66. Disponível em: https://dl.acm.org/doi/10.1145/2065003.2065016. Acesso em: 10 mar. 2026.

YU, X. Disaggregation: A New Architecture for Cloud Databases. Proceedings of the VLDB Endowment, [s. l.], v. 18, n. 12, p. 5527–5530, 2025.

ZARATE, G. et al. Evolution of Extract-Transform-Load (ETL) processes towards data product pipelines. In: ESAAM 2024: 4TH ECLIPSE SECURITY, AI, ARCHITECTURE AND MODELLING CONFERENCE ON DATA SPACE, 2024, Mainz Germany. Proceedings of the 4th Eclipse Security, AI, Architecture and Modelling Conference on Data Space. Mainz Germany: ACM, 2024. p. 25–32. Disponível em: https://dl.acm.org/doi/10.1145/3685651.3686662. Acesso em: 19 fev. 2026.

Published

2026-04-22

How to Cite

SHIMAOKA, A. M.; SALVADOR, M. E.; DUARTE, J. M.; SILVA JUNIOR, A. C. da; LOPES, L. R.; BANDIERA-PAIVA, P. Scalable ETL Pipeline for Health Data Ingestion: Application to the Brazilian Unified Health System (SUS): aplicação ao Sistema Único de Saúde (SUS). Asklepion: Informação em Saúde, Rio de Janeiro, RJ, v. 5, n. 1, p. e–132, 2026. DOI: 10.21728/asklepion.2026v5n1e-132. Disponível em: https://revistaasklepion.emnuvens.com.br/asklepion/article/view/132. Acesso em: 25 apr. 2026.