Course duration: 5,66h

In this course, data engineer and technical writer Thalia Barrera offers an introductory yet comprehensive overview of data lakes. Learn about key concepts like data lake architecture, operation, and integration with existing data systems. Delve into how data lakes are integral to AI and machine learning workflows. Go over the differences between data lakes, data warehouses, and databases. Explore various data formats and their applicability in a data lake environment. Use included hands-on exercises to practice setting up a basic data lake and performing simple data operations. When you finish this course, you will be equipped to make informed decisions about implementing and managing data lakes in your organization.

Topics include:
  • Define and understand the basic architecture of data lakes, and lakehouses, including the distinction between them and how they address different needs in data management and analysis.
  • Analyze complex real-world data using advanced analytics, build machine learning models, and develop generative AI applications by connecting to and querying data lakes or lakehouses.
  • Apply both SQL and Python to extract insights from structured and unstructured data stored in a data lakehouse, demonstrating the flexibility of querying languages and tools.
  • Utilize open-source technologies such as Dremio, MinIO, and Apache Iceberg to build and manage a scalable and efficient data lakehouse environment.
  • Incorporate advanced data management techniques such as indexing, vector embeddings, and the use of vector databases like Chroma to enhance query performance and support sophisticated data operations like retrieval-augmented generation.
  • Implement best practices in data governance, including role-based and attribute-based access controls, to ensure data privacy, security, and compliance within the lakehouse architecture.
  • Demonstrate proficiency in using business intelligence tools like Apache Superset for creating interactive dashboards and visualizations that support data-driven decision-making.
  • Leverage innovative frameworks like LangChain to simplify the integration of large language models and other components, facilitating the development of complex language-based AI applications.
  • Execute end-to-end data pipelines using orchestration tools such as Dagster, ensuring that data flows smoothly between different stages of processing and transformation.
  • Explore the capabilities of generative AI applications by developing a sales copilot that provides contextually accurate responses to user queries, showcasing the potential of combining large language models with a data lakehouse.