SiamCafe.net Blog
Technology

Vector Database Pinecone Load Testing Strategy ทดสอบประสิทธิภาพ AI Search

vector database pinecone load testing strategy
Vector Database Pinecone Load Testing Strategy | SiamCafe Blog
2026-03-15· อ. บอม — SiamCafe.net· 1,138 คำ

Vector Database ????????? Pinecone ?????????????????????

Vector Database ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? high-dimensional vectors (embeddings) ????????? similarity search ????????? exact match queries ????????????????????????????????? AI/ML applications ???????????? semantic search, recommendation systems, RAG (Retrieval-Augmented Generation)

Pinecone ???????????? managed vector database ????????????????????????????????????????????????????????????????????? ??????????????? Fully managed ????????????????????? maintain infrastructure, ?????????????????? billions of vectors, Low latency (single-digit milliseconds), Metadata filtering, Namespaces ?????????????????? multi-tenancy, Serverless ????????? Pod-based deployment options

Load Testing ?????????????????? vector database ?????????????????????????????? ?????????????????????????????? system handle concurrent queries ?????????????????????????????????, latency ????????? different loads ?????????????????????????????????, optimal batch size ?????????????????? upsert operations, cost projection ????????? usage patterns, capacity planning ?????????????????? growth

????????????????????? Pinecone ?????????????????? Load Testing

Setup Pinecone index ?????????????????? benchmarking

# === Pinecone Load Testing Setup ===

# 1. Install dependencies
pip install pinecone-client numpy locust aiohttp

# 2. Create Pinecone Index
cat > setup_pinecone.py << 'PYEOF'
#!/usr/bin/env python3
"""Setup Pinecone index for load testing"""
import os
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create serverless index
pc.create_index(
    name="load-test-index",
    dimension=1536,  # OpenAI ada-002 embedding size
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

print("Index created: load-test-index")

# Seed with test data
import numpy as np

index = pc.Index("load-test-index")

# Generate and upsert test vectors
batch_size = 100
total_vectors = 100000

for i in range(0, total_vectors, batch_size):
    vectors = []
    for j in range(batch_size):
        vec_id = f"vec-{i+j}"
        values = np.random.rand(1536).tolist()
        metadata = {
            "category": f"cat-{(i+j) % 50}",
            "source": f"doc-{(i+j) % 1000}",
            "timestamp": 1700000000 + (i+j),
        }
        vectors.append({"id": vec_id, "values": values, "metadata": metadata})
    
    index.upsert(vectors=vectors, namespace="benchmark")
    
    if (i + batch_size) % 10000 == 0:
        print(f"Upserted {i + batch_size}/{total_vectors} vectors")

stats = index.describe_index_stats()
print(f"Index stats: {stats}")
PYEOF

# 3. Configuration
cat > load_test_config.yaml << 'EOF'
pinecone:
  api_key_env: "PINECONE_API_KEY"
  index_name: "load-test-index"
  namespace: "benchmark"
  dimension: 1536

load_test:
  scenarios:
    query_only:
      users: [10, 50, 100, 200, 500]
      duration_seconds: 60
      top_k: 10
      
    upsert_only:
      users: [5, 10, 25, 50]
      batch_sizes: [10, 50, 100, 500]
      duration_seconds: 60
      
    mixed_workload:
      query_pct: 80
      upsert_pct: 20
      users: [50, 100, 200]
      duration_seconds: 120

  targets:
    p50_latency_ms: 50
    p99_latency_ms: 200
    error_rate_pct: 0.1
    throughput_qps: 500
EOF

echo "Pinecone setup complete"

??????????????? Load Testing Framework

Python load testing framework ?????????????????? Pinecone

#!/usr/bin/env python3
# pinecone_loadtest.py ??? Pinecone Load Testing Framework
import json
import logging
import time
import random
import math
from typing import Dict, List
from concurrent.futures import ThreadPoolExecutor, as_completed

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("loadtest")

class PineconeLoadTester:
    """Load testing framework for Pinecone vector database"""
    
    def __init__(self, dimension=1536):
        self.dimension = dimension
        self.results = []
        self.errors = 0
    
    def generate_query_vector(self):
        """Generate random query vector"""
        import random
        return [random.random() for _ in range(self.dimension)]
    
    def simulate_query(self, query_id):
        """Simulate a query operation"""
        start = time.time()
        try:
            # Simulate Pinecone query latency
            vector = self.generate_query_vector()
            # In real test: index.query(vector=vector, top_k=10, namespace="benchmark")
            latency = random.uniform(5, 50)  # Simulated ms
            time.sleep(latency / 1000)
            
            elapsed = (time.time() - start) * 1000
            return {"query_id": query_id, "latency_ms": round(elapsed, 2), "status": "ok", "matches": 10}
        except Exception as e:
            self.errors += 1
            return {"query_id": query_id, "latency_ms": 0, "status": "error", "error": str(e)}
    
    def simulate_upsert(self, batch_id, batch_size=100):
        """Simulate an upsert operation"""
        start = time.time()
        try:
            vectors = [
                {"id": f"batch-{batch_id}-{i}", "values": self.generate_query_vector()}
                for i in range(batch_size)
            ]
            # In real test: index.upsert(vectors=vectors, namespace="benchmark")
            latency = random.uniform(20, 200)  # Simulated ms
            time.sleep(latency / 1000)
            
            elapsed = (time.time() - start) * 1000
            return {"batch_id": batch_id, "latency_ms": round(elapsed, 2), "status": "ok", "vectors": batch_size}
        except Exception as e:
            self.errors += 1
            return {"batch_id": batch_id, "latency_ms": 0, "status": "error"}
    
    def run_query_load_test(self, num_queries, concurrency):
        """Run query load test"""
        results = []
        start_time = time.time()
        
        with ThreadPoolExecutor(max_workers=concurrency) as executor:
            futures = {executor.submit(self.simulate_query, i): i for i in range(num_queries)}
            for future in as_completed(futures):
                result = future.result()
                results.append(result)
        
        elapsed = time.time() - start_time
        latencies = [r["latency_ms"] for r in results if r["status"] == "ok"]
        latencies.sort()
        
        return {
            "test_type": "query",
            "total_queries": num_queries,
            "concurrency": concurrency,
            "duration_seconds": round(elapsed, 2),
            "throughput_qps": round(num_queries / elapsed, 1),
            "latency": {
                "p50_ms": round(latencies[len(latencies)//2], 2) if latencies else 0,
                "p95_ms": round(latencies[int(len(latencies)*0.95)], 2) if latencies else 0,
                "p99_ms": round(latencies[int(len(latencies)*0.99)], 2) if latencies else 0,
                "avg_ms": round(sum(latencies)/len(latencies), 2) if latencies else 0,
            },
            "errors": self.errors,
            "error_rate": round(self.errors / num_queries * 100, 3),
        }

# Run load tests
tester = PineconeLoadTester(dimension=1536)

# Test 1: Query load test
print("=== Query Load Test ===")
for concurrency in [10, 50, 100]:
    result = tester.run_query_load_test(num_queries=500, concurrency=concurrency)
    print(f"\n  Concurrency={concurrency}:")
    print(f"    QPS: {result['throughput_qps']}")
    print(f"    P50: {result['latency']['p50_ms']}ms, P99: {result['latency']['p99_ms']}ms")
    print(f"    Errors: {result['error_rate']}%")

Performance Benchmarking

Benchmark Pinecone ???????????? Locust

# === Locust Load Test for Pinecone ===

cat > locustfile.py << 'PYEOF'
#!/usr/bin/env python3
"""Locust load test for Pinecone vector database"""
from locust import HttpUser, task, between, events
import numpy as np
import json
import time
import os

PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY", "test-key")
INDEX_HOST = os.environ.get("PINECONE_HOST", "load-test-index-xxxxx.svc.aped-xxxx.pinecone.io")
DIMENSION = 1536

class PineconeUser(HttpUser):
    wait_time = between(0.1, 0.5)
    host = f"https://{INDEX_HOST}"
    
    def on_start(self):
        self.headers = {
            "Api-Key": PINECONE_API_KEY,
            "Content-Type": "application/json",
        }
    
    @task(8)
    def query_vector(self):
        """Query similar vectors (80% of traffic)"""
        vector = np.random.rand(DIMENSION).tolist()
        payload = {
            "vector": vector,
            "topK": 10,
            "namespace": "benchmark",
            "includeMetadata": True,
        }
        
        with self.client.post(
            "/query",
            json=payload,
            headers=self.headers,
            name="query",
            catch_response=True,
        ) as response:
            if response.status_code == 200:
                data = response.json()
                if len(data.get("matches", [])) > 0:
                    response.success()
                else:
                    response.failure("No matches returned")
            else:
                response.failure(f"Status {response.status_code}")
    
    @task(2)
    def upsert_vectors(self):
        """Upsert batch of vectors (20% of traffic)"""
        vectors = []
        for i in range(10):
            vectors.append({
                "id": f"locust-{int(time.time()*1000)}-{i}",
                "values": np.random.rand(DIMENSION).tolist(),
                "metadata": {"source": "locust", "timestamp": int(time.time())},
            })
        
        payload = {
            "vectors": vectors,
            "namespace": "benchmark",
        }
        
        with self.client.post(
            "/vectors/upsert",
            json=payload,
            headers=self.headers,
            name="upsert",
            catch_response=True,
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Status {response.status_code}")
    
    @task(1)
    def query_with_filter(self):
        """Query with metadata filter (10% of traffic)"""
        vector = np.random.rand(DIMENSION).tolist()
        payload = {
            "vector": vector,
            "topK": 5,
            "namespace": "benchmark",
            "filter": {
                "category": {"$eq": f"cat-{np.random.randint(0, 50)}"},
            },
            "includeMetadata": True,
        }
        
        self.client.post(
            "/query",
            json=payload,
            headers=self.headers,
            name="query_filtered",
        )
PYEOF

# Run Locust
# locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m
# locust -f locustfile.py (web UI at localhost:8089)

echo "Locust load test configured"

Scaling ????????? Optimization

????????????????????????????????????????????????????????? Pinecone

#!/usr/bin/env python3
# optimization.py ??? Pinecone Optimization Strategies
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("optimize")

class PineconeOptimizer:
    def __init__(self):
        pass
    
    def optimization_strategies(self):
        return {
            "query_optimization": {
                "reduce_dimension": {
                    "description": "?????? embedding dimension (1536 ??? 512 ???????????? PCA/Matryoshka)",
                    "impact": "2-3x faster queries, 60% less storage",
                    "tradeoff": "Slight recall reduction (2-5%)",
                },
                "reduce_top_k": {
                    "description": "????????? top_k ??????????????????????????????????????? (100 ??? 10)",
                    "impact": "Significant latency reduction",
                },
                "metadata_indexing": {
                    "description": "Index ??????????????? metadata fields ????????? filter ????????????",
                    "impact": "Faster filtered queries",
                },
                "namespace_partitioning": {
                    "description": "????????? data ???????????? namespaces ????????? use case",
                    "impact": "Smaller search space, faster queries",
                },
                "batch_queries": {
                    "description": "????????? queries ???????????? batch request",
                    "impact": "Reduce HTTP overhead, higher throughput",
                },
            },
            "upsert_optimization": {
                "batch_size": {
                    "recommendation": "100-500 vectors per batch",
                    "too_small": "< 10 vectors = high HTTP overhead",
                    "too_large": "> 1000 vectors = timeout risk",
                },
                "parallel_upsert": {
                    "description": "Upsert multiple batches in parallel",
                    "max_concurrent": "10-20 parallel requests",
                },
                "async_upsert": {
                    "description": "????????? asyncio ?????????????????? non-blocking upserts",
                    "impact": "3-5x throughput improvement",
                },
            },
            "infrastructure": {
                "serverless_vs_pods": {
                    "serverless": "Auto-scaling, pay per usage, good for variable loads",
                    "pods": "Dedicated resources, predictable latency, good for steady loads",
                    "recommendation": "Start serverless, switch to pods if latency critical",
                },
                "pod_sizing": {
                    "s1": "Up to 1M vectors, development",
                    "p1": "Up to 5M vectors, production (balanced)",
                    "p2": "Up to 20M vectors, high-performance (lower latency)",
                },
                "replicas": {
                    "description": "??????????????? replicas ?????????????????? read throughput",
                    "impact": "2x replicas = 2x read throughput",
                    "cost": "Linear cost increase",
                },
            },
        }
    
    def capacity_planning(self, num_vectors, dimension, queries_per_sec, growth_rate_monthly_pct):
        """Estimate Pinecone capacity and cost"""
        storage_per_vector_kb = dimension * 4 / 1024  # float32
        total_storage_gb = num_vectors * storage_per_vector_kb / (1024 * 1024)
        
        # Serverless pricing estimate
        read_units_per_query = 5  # approximate
        write_units_per_upsert = 5
        monthly_read_units = queries_per_sec * 86400 * 30 * read_units_per_query
        
        # Growth projection (12 months)
        projections = []
        current = num_vectors
        for month in range(12):
            current = int(current * (1 + growth_rate_monthly_pct / 100))
            projections.append({"month": month + 1, "vectors": current})
        
        return {
            "current": {
                "vectors": num_vectors,
                "storage_gb": round(total_storage_gb, 2),
                "queries_per_sec": queries_per_sec,
            },
            "12_month_projection": projections[-1],
            "recommendation": "p1.x2" if num_vectors > 1000000 else "serverless",
        }

optimizer = PineconeOptimizer()
plan = optimizer.capacity_planning(
    num_vectors=5000000,
    dimension=1536,
    queries_per_sec=200,
    growth_rate_monthly_pct=10,
)
print(f"Capacity Plan:")
print(f"  Current: {plan['current']['vectors']:,} vectors, {plan['current']['storage_gb']} GB")
print(f"  12-month: {plan['12_month_projection']['vectors']:,} vectors")
print(f"  Recommendation: {plan['recommendation']}")

strategies = optimizer.optimization_strategies()
print(f"\nQuery Optimizations:")
for name, info in strategies["query_optimization"].items():
    print(f"  {name}: {info.get('description', info.get('recommendation', ''))}")

Monitoring ????????? Cost Analysis

??????????????????????????????????????????????????????????????????????????????????????????

#!/usr/bin/env python3
# pinecone_monitor.py ??? Pinecone Monitoring & Cost Analysis
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("monitor")

class PineconeMonitor:
    def __init__(self):
        pass
    
    def load_test_results(self):
        return {
            "test_summary": {
                "date": "2024-06-15",
                "index": "production-embeddings",
                "vectors": 5000000,
                "dimension": 1536,
            },
            "scenarios": [
                {
                    "name": "Light Load (50 QPS)",
                    "concurrency": 10,
                    "qps": 50,
                    "p50_ms": 8.2,
                    "p95_ms": 15.5,
                    "p99_ms": 28.3,
                    "error_rate": 0.0,
                    "status": "PASS",
                },
                {
                    "name": "Medium Load (200 QPS)",
                    "concurrency": 50,
                    "qps": 200,
                    "p50_ms": 12.5,
                    "p95_ms": 32.1,
                    "p99_ms": 58.7,
                    "error_rate": 0.02,
                    "status": "PASS",
                },
                {
                    "name": "Heavy Load (500 QPS)",
                    "concurrency": 100,
                    "qps": 485,
                    "p50_ms": 25.3,
                    "p95_ms": 85.2,
                    "p99_ms": 165.8,
                    "error_rate": 0.15,
                    "status": "PASS",
                },
                {
                    "name": "Stress Test (1000 QPS)",
                    "concurrency": 200,
                    "qps": 820,
                    "p50_ms": 52.1,
                    "p95_ms": 210.5,
                    "p99_ms": 450.2,
                    "error_rate": 2.3,
                    "status": "FAIL (p99 > 200ms)",
                },
            ],
            "cost_analysis": {
                "serverless": {
                    "monthly_read_units": 25920000,
                    "monthly_write_units": 500000,
                    "storage_gb": 30,
                    "estimated_monthly": "$85-150",
                },
                "pod_p1_x2": {
                    "pods": 2,
                    "replicas": 2,
                    "estimated_monthly": "$140",
                    "note": "Predictable latency, dedicated resources",
                },
            },
        }

monitor = PineconeMonitor()
results = monitor.load_test_results()
print(f"Load Test Results ??? {results['test_summary']['date']}")
print(f"Index: {results['test_summary']['vectors']:,} vectors, dim={results['test_summary']['dimension']}")

print(f"\nScenarios:")
for s in results["scenarios"]:
    print(f"  {s['name']}:")
    print(f"    QPS={s['qps']}, P50={s['p50_ms']}ms, P99={s['p99_ms']}ms, Errors={s['error_rate']}% [{s['status']}]")

cost = results["cost_analysis"]
print(f"\nCost Analysis:")
print(f"  Serverless: {cost['serverless']['estimated_monthly']}/month")
print(f"  Pod (p1.x2): {cost['pod_p1_x2']['estimated_monthly']}/month")

FAQ ??????????????????????????????????????????

Q: Pinecone ????????? Weaviate ????????? Milvus ??????????????????????????????????

A: Pinecone ???????????? fully managed ?????????????????????????????? ????????????????????? maintain infrastructure performance ??????????????? ??????????????? teams ???????????????????????? resources ???????????? infra ???????????? pay-per-use (serverless) ???????????? fixed (pods) Weaviate open source self-host ????????? (?????????) ?????? managed cloud ???????????? schema-aware (object storage ??????????????????????????? vectors) modules ?????????????????? vectorization built-in ??????????????? teams ????????????????????? control infrastructure Milvus open source high-performance ?????????????????? billions of vectors multiple index types (IVF, HNSW, DiskANN) ??????????????? large-scale production ????????????????????? tuning ????????????????????? ??????????????? Pinecone ??????????????? ????????????????????? managed service ??????????????? ops team, Weaviate ??????????????? ???????????? self-host + schema, Milvus ??????????????? ????????????????????? performance ?????????????????? + scale ?????????????????????

Q: Embedding dimension ???????????????????????? performance ??????????????????????

A: Dimension ????????????????????? = latency ????????????????????? + storage ????????????????????? + cost ????????????????????? ????????? accuracy/recall ?????????????????? ?????????????????? OpenAI text-embedding-3-small: 1536 dim (default) ???????????? 512 dim (Matryoshka), text-embedding-3-large: 3072 dim (default) ???????????? 1024/256 dim, Cohere embed-v3: 1024 dim ??????????????? ??????????????????????????? 512-1024 dim ????????? recall ????????????????????? ?????????????????????????????? 1536+ ?????? benchmark ????????????????????????????????? recall@10 ????????? different dimensions ???????????????????????? recall acceptable ????????? performance/cost ???????????????????????? Matryoshka embeddings (OpenAI v3) ?????????????????? dimensions ?????????????????? recall ???????????????????????????

Q: Load test ????????? test ?????????????????????????

A: ???????????? test ????????? 4 scenarios Query throughput ????????? QPS ????????? P99 latency ????????? acceptable (???????????? ????????????????????? 200ms), Upsert throughput ????????? vectors/second ????????? upsert ?????????????????????????????????????????? query latency, Mixed workload 80/20 query/upsert ????????? production-like traffic pattern, Spike test traffic spike 5-10x ???????????? ???????????? handle ?????????????????? (???????????? flash sale) Metrics ?????????????????????????????? Latency (P50, P95, P99), Throughput (QPS), Error rate, Recall@K (accuracy ?????????????????????????????????????????? load ?????????) Tools Locust (Python, flexible), k6 (Go, fast), Artillery (Node.js, YAML config)

Q: Serverless ????????? Pod-based ??????????????????????????????????

A: Serverless ??????????????? Traffic ????????????????????????????????? (spiky), ??????????????????????????????????????????/prototype, ????????????????????? auto-scale ????????????????????? manage, ????????????????????? predict capacity ???????????????????????? ????????????????????? cold start latency (first query ???????????? idle ??????????????????) Pod-based ??????????????? Traffic ???????????????????????? (predictable), ????????????????????? latency ????????? consistent (??????????????? cold start), high throughput ???????????????????????????, ???????????? guarantee performance SLA ????????????????????? ???????????? size ???????????????????????? ??????????????????????????????????????????????????? use ????????????????????? ??????????????? ??????????????????????????? serverless ??????????????? traffic stable ????????? latency requirements ????????????????????? ???????????? migrate ?????? pods

📖 บทความที่เกี่ยวข้อง

Vector Database Pinecone Micro-segmentationอ่านบทความ → Vector Database Pinecone Tech Conference 2026อ่านบทความ → Azure Front Door Load Testing Strategyอ่านบทความ → Vector Database Pinecone Edge Deploymentอ่านบทความ → vector database คืออ่านบทความ →

📚 ดูบทความทั้งหมด →