Glue Crawler ทำหน้าที่อะไร

Glue Crawler สแกนข้อมูลใน S3, RDS, DynamoDB หรือ JDBC Source แล้วสร้าง Table Definition ใน Glue Data Catalog อัตโนมัติ ทำให้สามารถ Query ข้อมูลด้วย Athena หรือ Redshift Spectrum ได้ทันทีโดยไม่ต้องกำหนด Schema เอง

DNS Management เกี่ยวข้องกับ ETL อย่างไร

DNS Management (Route 53) ใช้จัดการ Endpoint ของ Data Pipeline เช่น Custom Domain สำหรับ API Gateway ที่ Trigger Glue Job, VPC Endpoint สำหรับ Glue ที่อยู่ใน Private Subnet และ DNS Resolution สำหรับ JDBC Connection ไปยัง Database

AWS Glue ราคาเท่าไร

Glue คิดตาม DPU-hour (Data Processing Unit) โดย 1 DPU = 4 vCPU + 16 GB RAM ราคาประมาณ $0.44/DPU-hour Crawler คิด $0.44/DPU-hour เช่นกัน Data Catalog ฟรี 1 ล้าน Object แรกต่อเดือน

Cloud

AWS Glue ETL DNS Management

Q: AWS Glue คืออะไร

AWS Glue เป็น Fully Managed Serverless ETL (Extract, Transform, Load) Service ของ AWS ช่วยสร้าง Data Pipeline สำหรับดึงข้อมูลจากแหล่งต่างๆ แปลงรูปแบบ แล้วโหลดเข้า Data Lake หรือ Data Warehouse โดยไม่ต้องจัดการ Server

Q: ETL คืออะไร

ETL ย่อมาจาก Extract (ดึงข้อมูลจากแหล่งต้นทาง) Transform (แปลงรูปแบบ ทำความสะอาด จัดโครงสร้าง) Load (โหลดเข้าระบบปลายทาง เช่น Data Warehouse) เป็นกระบวนการพื้นฐานของ Data Engineering

AWS Glue ETL DNS Management | SiamCafe Blog

โดย อ. บอมกิตติทัศน์เจริญพนาสิทธิ์ | อัปเดต 24 ก. พ. 2026 | อ่าน 15 นาที

ETL คืออะไร — พื้นฐาน Data Pipeline
AWS Glue คืออะไร — Serverless ETL ของ AWS
สถาปัตยกรรม AWS Glue — Component ทั้งหมด
Glue Data Catalog — Metadata Store กลาง
Glue Crawler — สแกนและสร้าง Schema อัตโนมัติ
Glue Job — สร้าง ETL Pipeline ด้วย PySpark
Glue Studio — Visual ETL Designer
Glue Workflow — Orchestrate หลาย Job
Glue Triggers และ Scheduling
DNS Management กับ Data Pipeline
Route 53 — DNS Service ของ AWS
VPC Endpoint สำหรับ Glue
Monitoring และ Troubleshooting
Cost Optimization
Best Practices และสรุป

ETL คืออะไร — พื้นฐาน Data Pipeline

ETL ย่อมาจาก Extract Transform Load เป็นกระบวนการพื้นฐานของ Data Engineering ที่ใช้ย้ายข้อมูลจากแหล่งต้นทาง (Source) ไปยังระบบปลายทาง (Destination) โดยผ่านการแปลงรูปแบบระหว่างทาง

Extract (ดึง) — ดึงข้อมูลจากแหล่งต้นทางเช่น Database (MySQL, PostgreSQL), API, File (CSV, JSON, Parquet), Log, Streaming Data (Kafka, Kinesis)
Transform (แปลง) — ทำความสะอาดข้อมูลลบ Duplicate จัดโครงสร้างใหม่คำนวณ Metric รวมข้อมูลจากหลายแหล่งแปลง Format เช่น JSON เป็น Parquet
Load (โหลด) — โหลดข้อมูลที่แปลงแล้วเข้าระบบปลายทางเช่น Data Warehouse (Redshift, BigQuery), Data Lake (S3), Search Engine (Elasticsearch)

ในยุค 2026 มีแนวคิดใหม่คือ ELT (Extract Load Transform) ที่โหลดข้อมูลดิบเข้า Data Lake ก่อนแล้วค่อย Transform ทีหลังด้วย SQL Engine เช่น Athena หรือ Redshift Spectrum เหมาะกับ Schema-on-Read ที่ไม่ต้องกำหนดโครงสร้างล่วงหน้า

AWS Glue คืออะไร — Serverless ETL ของ AWS

AWS Glue เป็น Fully Managed Serverless ETL Service ที่ AWS เปิดตัวในปี 2017 ช่วยสร้าง Data Pipeline โดยไม่ต้องจัดการ Server ระบบจะ Provision Compute Resource อัตโนมัติตาม Workload จ่ายเฉพาะเวลาที่ Job รันจริง

จุดเด่นของ AWS Glue ได้แก่ Serverless ไม่ต้องจัดการ Infrastructure, Auto-scaling ปรับ Resource ตาม Data Size, Built-in PySpark/Scala Engine สำหรับ Big Data, Data Catalog เป็น Metadata Store กลาง, Crawler สแกน Schema อัตโนมัติ, Visual ETL Designer (Glue Studio) สร้าง Pipeline ด้วย Drag-and-Drop และ Integration กับ S3, RDS, Redshift, DynamoDB, Kinesis และ Service อื่นๆของ AWS

สถาปัตยกรรม AWS Glue — Component ทั้งหมด

Component	หน้าที่	ตัวอย่างการใช้งาน
Data Catalog	เก็บ Metadata ของ Database, Table, Schema	Table Definition สำหรับ S3 Data Lake
Crawler	สแกน Data Source สร้าง Table ใน Catalog	สแกน S3 Bucket หา Parquet Files
ETL Job	รัน PySpark/Python Script แปลงข้อมูล	แปลง CSV เป็น Parquet, Clean Data
Glue Studio	Visual ETL Designer แบบ Drag-and-Drop	สร้าง Pipeline ไม่ต้องเขียนโค้ด
Workflow	Orchestrate หลาย Crawler/Job	Crawler → Job → Job แบบ Sequence
Trigger	กำหนดเวลาหรือเงื่อนไขรัน Job	รันทุกวันตี 2, รันเมื่อ S3 มีไฟล์ใหม่
Connection	เก็บข้อมูลเชื่อมต่อ Database	JDBC Connection ไปยัง RDS MySQL
Dev Endpoint	สภาพแวดล้อมพัฒนาและทดสอบ	Jupyter Notebook สำหรับ Debug

Glue Data Catalog — Metadata Store กลาง

Glue Data Catalog เป็น Centralized Metadata Repository ที่เก็บข้อมูลเกี่ยวกับ Table, Database, Schema, Partition ทั้งหมดทำหน้าที่เหมือน Hive Metastore แต่เป็น Managed Service ไม่ต้องดูแลเอง Data Catalog ถูกใช้โดย Service อื่นๆของ AWS เช่น Athena (Query S3 ด้วย SQL), Redshift Spectrum (Join S3 กับ Redshift), EMR (Spark/Hive) และ Lake Formation (Data Governance)

# สร้าง Database ใน Glue Catalog ด้วย Boto3
import boto3

glue = boto3.client('glue', region_name='ap-southeast-1')

# สร้าง Database
glue.create_database(
 DatabaseInput={
 'Name': 'ecommerce_datalake',
 'Description': 'E-commerce Data Lake',
 'LocationUri': 's3://my-datalake/ecommerce/'
 }
)

# สร้าง Table
glue.create_table(
 DatabaseName='ecommerce_datalake',
 TableInput={
 'Name': 'orders',
 'StorageDescriptor': {
 'Columns': [
 {'Name': 'order_id', 'Type': 'string'},
 {'Name': 'customer_id', 'Type': 'string'},
 {'Name': 'amount', 'Type': 'double'},
 {'Name': 'status', 'Type': 'string'},
 {'Name': 'created_at', 'Type': 'timestamp'}
 ],
 'Location': 's3://my-datalake/ecommerce/orders/',
 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
 'SerdeInfo': {
 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
 }
 },
 'PartitionKeys': [
 {'Name': 'year', 'Type': 'string'},
 {'Name': 'month', 'Type': 'string'}
 ],
 'TableType': 'EXTERNAL_TABLE'
 }
)

Glue Crawler — สแกนและสร้าง Schema อัตโนมัติ

Crawler สแกนข้อมูลใน Data Source แล้วสร้างหรืออัปเดต Table Definition ใน Data Catalog อัตโนมัติไม่ต้องกำหนด Schema ด้วยมือเหมาะกับ Data Lake ที่มีข้อมูลหลายร้อย Table

# สร้าง Crawler ด้วย AWS CLI
aws glue create-crawler \
 --name orders-crawler \
 --role arn:aws:iam::123456789:role/GlueServiceRole \
 --database-name ecommerce_datalake \
 --targets '{
 "S3Targets": [
 {
 "Path": "s3://my-datalake/ecommerce/orders/",
 "Exclusions": ["**.tmp", "**.log"]
 }
 ]
 }' \
 --schedule "cron(0 2 * * ? *)" \
 --schema-change-policy '{
 "UpdateBehavior": "UPDATE_IN_DATABASE",
 "DeleteBehavior": "LOG"
 }' \
 --configuration '{
 "Version": 1.0,
 "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas"}
 }'

# รัน Crawler
aws glue start-crawler --name orders-crawler

# ดู Status
aws glue get-crawler --name orders-crawler --query 'Crawler.State'

Glue Job — สร้าง ETL Pipeline ด้วย PySpark

Glue Job คือหัวใจของ ETL Pipeline รัน PySpark หรือ Python Script บน Serverless Spark Cluster สำหรับ Transform ข้อมูล

# glue_etl_job.py — ตัวอย่าง Glue ETL Job
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# 1. Extract — อ่านจาก Glue Catalog
orders_df = glueContext.create_dynamic_frame.from_catalog(
 database="ecommerce_datalake",
 table_name="raw_orders"
)

# 2. Transform
# แปลงเป็น Spark DataFrame
df = orders_df.toDF()

# ลบ Duplicate
df = df.dropDuplicates(['order_id'])

# กรองเฉพาะ Status ที่ต้องการ
df = df.filter(df.status.isin(['completed', 'shipped']))

# คำนวณ Total Amount
from pyspark.sql.functions import sum, count, col, year, month
summary = df.groupBy(
 year('created_at').alias('year'),
 month('created_at').alias('month')
).agg(
 count('order_id').alias('total_orders'),
 sum('amount').alias('total_revenue')
)

# แปลงกลับเป็น DynamicFrame
output = DynamicFrame.fromDF(summary, glueContext, "output")

# 3. Load — เขียนลง S3 เป็น Parquet
glueContext.write_dynamic_frame.from_options(
 frame=output,
 connection_type="s3",
 connection_options={
 "path": "s3://my-datalake/ecommerce/monthly_summary/",
 "partitionKeys": ["year", "month"]
 },
 format="parquet"
)

job.commit()

Glue Studio — Visual ETL Designer

Glue Studio เป็น Visual Interface สำหรับสร้าง ETL Pipeline แบบ Drag-and-Drop ไม่ต้องเขียนโค้ดเหมาะกับ Data Analyst หรือ Business User ที่ไม่ถนัด PySpark

ขั้นตอนใช้งาน Glue Studio คือเปิด AWS Console → Glue Studio → Create Job → Visual with a source and target ลาก Source Node (เช่น S3, Catalog) → เพิ่ม Transform Node (Filter, Join, Aggregate, Rename) → ลาก Target Node (S3, Redshift, Catalog) เชื่อม Node ด้วยลูกศรแล้ว Configure แต่ละ Node ทดสอบด้วย Data Preview แล้วกด Run

Glue Workflow — Orchestrate หลาย Job

Glue Workflow ใช้เรียง Crawler และ Job หลายตัวเป็น Pipeline เดียว

# สร้าง Workflow
aws glue create-workflow --name daily-etl-pipeline

# เพิ่ม Trigger: เริ่มต้น (Schedule)
aws glue create-trigger \
 --name start-daily \
 --workflow-name daily-etl-pipeline \
 --type SCHEDULED \
 --schedule "cron(0 2 * * ? *)" \
 --actions '[{"CrawlerName": "orders-crawler"}]'

# เพิ่ม Trigger: หลัง Crawler เสร็จ → รัน ETL Job
aws glue create-trigger \
 --name after-crawl \
 --workflow-name daily-etl-pipeline \
 --type CONDITIONAL \
 --predicate '{"Conditions": [{"CrawlerName": "orders-crawler", "CrawlState": "SUCCEEDED"}]}' \
 --actions '[{"JobName": "etl-transform-orders"}]'

# เพิ่ม Trigger: หลัง ETL Job เสร็จ → รัน Summary Job
aws glue create-trigger \
 --name after-transform \
 --workflow-name daily-etl-pipeline \
 --type CONDITIONAL \
 --predicate '{"Conditions": [{"JobName": "etl-transform-orders", "State": "SUCCEEDED"}]}' \
 --actions '[{"JobName": "etl-monthly-summary"}]'

Glue Triggers และ Scheduling

Trigger มี 3 แบบ Scheduled รันตามเวลา (Cron Expression), Conditional รันเมื่อ Job/Crawler อื่นเสร็จ (ใช้ใน Workflow) และ On-demand รันด้วยมือหรือ API Call สำหรับ Event-driven ETL ใช้ EventBridge Rule + Lambda + Glue Start Job API เมื่อมีไฟล์ใหม่ใน S3

DNS Management กับ Data Pipeline

DNS Management เข้ามาเกี่ยวข้องกับ Data Pipeline ในหลายจุดข้อแรกคือ JDBC Connection ที่ Glue Job เชื่อมต่อ Database ผ่าน DNS Name เช่น mydb.xxxx.ap-southeast-1.rds.amazonaws.com ถ้า DNS Resolution ล้มเหลว Job จะ Fail ข้อสองคือ Custom Domain สำหรับ API ที่ Trigger ETL เช่น etl-api.example.com ชี้ไปยัง API Gateway ข้อสามคือ VPC DNS Resolution สำหรับ Glue Job ที่รันใน VPC ต้อง Resolve DNS ของ Service ภายในได้และข้อสี่คือ Cross-account Access ที่ต้อง Resolve DNS ของ Resource ใน Account อื่นผ่าน Route 53 Private Hosted Zone

Route 53 — DNS Service ของ AWS

Amazon Route 53 เป็น Managed DNS Service ของ AWS ที่ให้บริการ DNS Resolution, Domain Registration และ Health Check

# สร้าง Hosted Zone
aws route53 create-hosted-zone \
 --name example.com \
 --caller-reference $(date +%s)

# สร้าง A Record ชี้ไป API Gateway
aws route53 change-resource-record-sets \
 --hosted-zone-id Z1234567890 \
 --change-batch '{
 "Changes": [{
 "Action": "CREATE",
 "ResourceRecordSet": {
 "Name": "etl-api.example.com",
 "Type": "A",
 "AliasTarget": {
 "HostedZoneId": "Z2FDTNDATAQYW2",
 "DNSName": "d-xxxx.execute-api.ap-southeast-1.amazonaws.com",
 "EvaluateTargetHealth": true
 }
 }
 }]
 }'

# Private Hosted Zone สำหรับ VPC
aws route53 create-hosted-zone \
 --name internal.example.com \
 --caller-reference $(date +%s) \
 --vpc VPCRegion=ap-southeast-1, VPCId=vpc-12345 \
 --hosted-zone-config PrivateZone=true

VPC Endpoint สำหรับ Glue

เมื่อ Glue Job รันใน VPC (เพื่อเข้าถึง RDS, ElastiCache) จะไม่มี Internet Access ต้องสร้าง VPC Endpoint สำหรับเรียกใช้ AWS Service

# สร้าง VPC Endpoint สำหรับ S3
aws ec2 create-vpc-endpoint \
 --vpc-id vpc-12345 \
 --service-name com.amazonaws.ap-southeast-1.s3 \
 --route-table-ids rtb-12345

# สร้าง VPC Endpoint สำหรับ Glue
aws ec2 create-vpc-endpoint \
 --vpc-id vpc-12345 \
 --service-name com.amazonaws.ap-southeast-1.glue \
 --vpc-endpoint-type Interface \
 --subnet-ids subnet-abc123 \
 --security-group-ids sg-12345 \
 --private-dns-enabled

Monitoring และ Troubleshooting

CloudWatch Metrics — ดู Job Duration, DPU Usage, Bytes Read/Written
CloudWatch Logs — Glue Job Output Logs และ Error Logs
Glue Job Run Insights — ดู Spark UI สำหรับ Debug Performance
Job Bookmarks — ติดตามว่า Process ข้อมูลถึงไหนแล้วไม่ Process ซ้ำ

ปัญหาที่พบบ่อยได้แก่ Job Timeout เพราะ Data มากเกินไปแก้ด้วยเพิ่ม DPU หรือ Partition Data ให้เล็กลง, OutOfMemory Error แก้ด้วยเพิ่ม Worker Type เป็น G.2X (32 GB RAM), Connection Timeout ไป Database แก้ด้วยตรวจ Security Group และ VPC Endpoint และ Crawler ไม่เจอ Data แก้ด้วยตรวจ S3 Path และ IAM Permission

Cost Optimization

ใช้ DPU ที่เหมาะสม — เริ่มจาก 2 DPU แล้วเพิ่มตาม Job Duration อย่าตั้งสูงเกิน
ใช้ Flex Execution — ราคาถูกกว่า Standard 35% แต่อาจเริ่มช้ากว่าเหมาะกับ Non-urgent Job
Auto Scaling — Glue 4.0 รองรับ Auto Scaling Worker ลด DPU เมื่อ Workload ต่ำ
Partition Data — Partition ตาม Date ทำให้ Crawler และ Job อ่านเฉพาะ Partition ที่ต้องการ
ใช้ Job Bookmarks — Process เฉพาะข้อมูลใหม่ไม่ Process ทั้งหมดซ้ำ
Columnar Format — ใช้ Parquet/ORC แทน CSV ลด Read I/O และ Compression

Best Practices และสรุป

ใช้ Data Catalog เป็น Single Source of Truth — ทุก Service อ่าน Schema จาก Catalog เดียวกัน
Partition Data ตาม Query Pattern — เช่น Partition ตาม year/month/day
ใช้ Parquet Format — ประหยัดพื้นที่ 80% เทียบกับ CSV และ Query เร็วกว่ามาก
Version Control ETL Script — เก็บ Glue Job Script ใน Git ไม่ใช่แก้ใน Console
ใช้ Job Bookmarks — ป้องกัน Duplicate Processing
Monitor ทุก Job Run — ตั้ง CloudWatch Alarm เมื่อ Job Fail หรือ Duration เกิน Threshold
จัดการ DNS อย่างเป็นระบบ — ใช้ Route 53 Private Hosted Zone สำหรับ Internal Endpoint
ใช้ VPC Endpoint — สำหรับ Glue Job ที่รันใน VPC ลด NAT Gateway Cost

AWS Glue เป็นเครื่องมือที่ทรงพลังสำหรับสร้าง Data Pipeline แบบ Serverless เมื่อรวมกับ DNS Management ที่เหมาะสมจะได้ระบบ Data Infrastructure ที่ Scalable, Cost-effective และ Maintainable ติดตามบทความใหม่ๆได้ที่ SiamCafe.net

บ

อ. บอมกิตติทัศน์เจริญพนาสิทธิ์
IT Infrastructure Expert | Thaiware Award | ประสบการณ์กว่า 25 ปีด้าน Network, Linux, Cloud & AI — ผู้ก่อตั้ง SiamCafe.net Since 2000-2026

Q: AWS Glue คืออะไร

Fully Managed Serverless ETL Service ของ AWS สร้าง Data Pipeline ดึง/แปลง/โหลดข้อมูลโดยไม่ต้องจัดการ Server

AWS Glue ETL DNS Management

ETL คืออะไร — พื้นฐาน Data Pipeline

AWS Glue คืออะไร — Serverless ETL ของ AWS

สถาปัตยกรรม AWS Glue — Component ทั้งหมด

Glue Data Catalog — Metadata Store กลาง

Glue Crawler — สแกนและสร้าง Schema อัตโนมัติ

Glue Job — สร้าง ETL Pipeline ด้วย PySpark

Glue Studio — Visual ETL Designer

Glue Workflow — Orchestrate หลาย Job

Glue Triggers และ Scheduling

DNS Management กับ Data Pipeline

Route 53 — DNS Service ของ AWS

VPC Endpoint สำหรับ Glue

Monitoring และ Troubleshooting

Cost Optimization

Best Practices และสรุป

Q: AWS Glue คืออะไร

Q: ETL คืออะไร

Q: Glue Crawler ทำหน้าที่อะไร

Q: DNS Management เกี่ยวกับ ETL อย่างไร

Q: AWS Glue ราคาเท่าไร

AWS Glue ETL DNS Management

ETL คืออะไร — พื้นฐาน Data Pipeline

AWS Glue คืออะไร — Serverless ETL ของ AWS

สถาปัตยกรรม AWS Glue — Component ทั้งหมด

Glue Data Catalog — Metadata Store กลาง

Glue Crawler — สแกนและสร้าง Schema อัตโนมัติ

Glue Job — สร้าง ETL Pipeline ด้วย PySpark

Glue Studio — Visual ETL Designer

Glue Workflow — Orchestrate หลาย Job

Glue Triggers และ Scheduling

DNS Management กับ Data Pipeline

Route 53 — DNS Service ของ AWS

VPC Endpoint สำหรับ Glue

Monitoring และ Troubleshooting

Cost Optimization

Best Practices และสรุป

Q: AWS Glue คืออะไร

Q: ETL คืออะไร

Q: Glue Crawler ทำหน้าที่อะไร

Q: DNS Management เกี่ยวกับ ETL อย่างไร

Q: AWS Glue ราคาเท่าไร

บทความที่เกี่ยวข้อง