How to Improve API Performance in Spring Boot Applications
Overview
This article is a thorough, practical, and technical guide to improving API performance in Spring Boot applications. It covers the performance problem space, theoretical foundations, measurement and profiling, concrete optimizations at every layer (network, serialization, application, persistence, JVM and container), relevant Spring-specific features and code examples, testing and monitoring, common pitfalls, and future trends.
Use this as both a reference and a checklist for diagnosing and improving real-world API performance.
Contents
- Background & history
- Key performance concepts
- Measurement and benchmarking (how to find bottlenecks)
- Request/response path: where to optimize
- Network & HTTP-level optimizations
- Serialization & payload size
- Concurrency, thread pools & async models
- Reactive/non-blocking architectures (WebFlux, R2DBC)
- Database and persistence optimizations (JPA/Hibernate)
- Caching strategies (in-memory, distributed)
- JVM and GC tuning for containers
- Container & deployment considerations
- Observability, profiling & performance testing
- Resilience & scaling patterns
- Practical checklists and example configs
- Future trends
- Conclusion
Background & history
Spring Boot simplified Spring application development by bundling dependencies, auto-configuration, embedded servers, and sensible defaults. As microservices and cloud-native deployments have become common, API performance became a central concern: latency, throughput, tail-latency, memory footprint, and cold-start time.
Historically, bottlenecks were often obvious (slow DB queries, insufficient threads). As systems became distributed and high-scale, performance tuning required thinking holistically: non-blocking I/O, reactive programming, connection pool tuning, serialization formats, JVM behavior inside containers, service mesh latency, observability and automated optimization.
Key performance concepts
- Latency vs Throughput vs Tail-latency. Reducing median latency is useful but reducing 95/99th percentile latency often matters more for user experience.
- Blocking vs Non-blocking I/O. Thread-per-request (blocking) fits many workloads; non-blocking (reactive) reduces thread overhead for highly concurrent I/O-bound services.
- CPU-bound vs I/O-bound. The tuning strategies differ: more threads or parallelism benefit CPU-bound workloads to a point; asynchronous I/O helps I/O-bound workloads.
- Backpressure and resource exhaustion. Always consider limits (thread pools, DB connections) and protect them with bounded queues/bulkheads to avoid cascading failures.
- Measurement first. Optimize based on measurement and profiling, not on guesswork.
Measurement and benchmarking
Before optimizing, measure. You need meaningful, repeatable metrics and a controlled test harness.
Tools:
- Load generators: k6, Gatling, JMeter, wrk, hey.
- APM/Tracing: Elastic APM, New Relic, Datadog, Jaeger, Zipkin.
- Metrics: Micrometer -> Prometheus + Grafana.
- Profilers: Java Flight Recorder (JFR), VisualVM, async-profiler, YourKit, Honorable mention: BPF-based tools for syscalls and I/O.
- OS tools: iostat, sar, vmstat, netstat, ss, tcpdump.
Important metrics:
- Throughput (requests/sec)
- Latencies: p50/p95/p99, server-side histograms
- CPU utilization, GC pauses, GC time
- Heap usage, allocation rate
- Thread counts and thread states
- Database connection pool usage and wait time
- Network queues, socket states
- Error rates & timeouts
A profiling workflow:
- Establish a baseline with controlled load.
- Collect metrics and traces during tests.
- Use flame graphs / CPU sampling to find hotspots.
- Validate suspected fixes with A/B tests.
Request/response path — where to optimize
Typical request lifecycle:
- TCP handshake / TLS negotiation (client → load balancer → service)
- Web server receives HTTP request (Tomcat/Jetty/Netty)
- Request parsed and mapped to controller
- Controller logic, service layer, DB access, external calls
- Serialization to JSON/Protobuf and response writing
- OS network stack sends response to client
Possible optimization points:
- Network & TLS setup (Keep-Alive, HTTP/2)
- Connection handling (Tomcat / Netty config)
- Request parsing, filters, security, and interceptors
- Controller and service CPU cost or blocking waits
- DB queries and remote I/O latency
- Serialization/deserialization costs and payload size
- Thread scheduling and GC pauses
- Downstream services and caches
Network and HTTP-level optimizations
- Keep-Alive and connection reuse: enable and tune keep-alive timeouts so clients reuse sockets.
- HTTP/2: reduces connection churn and head-of-line blocking; enable at load balancer and server (ALPN).
- Compression: enable gzip/deflate (for JSON payloads) but balance CPU cost vs bandwidth. Enable selective compression for text responses.
- TLS session resumption & TLS offload: minimize TLS handshake cost; consider LB/TLS offload.
- Content negotiation and caching headers: use Cache-Control, ETag, Last-Modified for cacheable responses.
- Use a CDN for static assets and caching reverse proxies (e.g., Varnish, Nginx) for cacheable API responses where appropriate.
- Limit request and response payload sizes; validate incoming payload early (e.g., via request size limit).
Spring Boot server-level configuration examples
Enable response compression (application.yml):
1server:
2 compression:
3 enabled: true
4 mime-types: application/json,text/html,text/xml,text/plain,application/javascript
5 min-response-size: 1024Adjust embedded Tomcat connector (application.yml):
1server:
2 tomcat:
3 threads:
4 max: 200
5 min-spare: 10
6 accept-count: 100
7 max-connections: 10000
8 connection-timeout: 20000Serialization and payload size
- Reduce payload size:
- Use projections and DTOs: return only necessary fields.
- Pagination and result-limiting to avoid sending huge lists.
- Compression (gzip) for text-based serialization.
- Use compact serial formats for internal/microservice comms (Protobuf, Avro, MessagePack).
- Optimize JSON serialization:
- Jackson tuning: avoid default typing, disable features that cost performance (e.g., FAIL_ON_UNKNOWN_PROPERTIES can be left on but not heavy), use Afterburner module for faster POJO serialization (bytecode generation).
- Use immutable/primitive-friendly DTOs and avoid deep object graphs.
- Consider alternative libraries (Gson, Jackson with Afterburner, DSL-optimized serializers).
- Reuse ObjectMappers: configure a single, reusable ObjectMapper bean; avoid creating new mappers per request.
- For large streaming data, use streaming APIs or reactive streaming to avoid full buffering.
Jackson Afterburner example (configuration):
1@Bean
2public ObjectMapper objectMapper() {
3 ObjectMapper mapper = new ObjectMapper();
4 mapper.registerModule(new AfterburnerModule());
5 // common optimizations
6 mapper.disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);
7 return mapper;
8}Concurrency, thread pools & async models
- Blocking (Servlet/WebMVC) model: uses a thread per request. Tune server thread pools (Tomcat/Jetty) and application thread pools for blocking tasks (DB calls, file I/O).
- Asynchronous processing: use @Async, CompletableFuture, or Spring’s DeferredResult/Callable to free request threads for other work. Use bounded thread pools with sensible queue sizes.
- Reactive model (non-blocking): use WebFlux (Reactor Netty) and reactive libraries for highly concurrent, I/O-bound workloads—only if all layers can be non-blocking (DB, HTTP clients).
- Thread sizing rules: for blocking workloads, threads ~= cores * (1 + W/C), where W = average waiting time, C = average compute time. Too many threads cause context switching and memory pressure.
- Never block Netty event loop threads; if you must block, offload to a bounded scheduler.
Example: configure a bounded Executor for @Async
1@EnableAsync
2@Configuration
3public class AsyncConfig {
4 @Bean(name = "taskExecutor")
5 public ThreadPoolTaskExecutor taskExecutor() {
6 ThreadPoolTaskExecutor ex = new ThreadPoolTaskExecutor();
7 ex.setCorePoolSize(20);
8 ex.setMaxPoolSize(50);
9 ex.setQueueCapacity(500); // bounded helps protect resources
10 ex.setThreadNamePrefix("api-async-");
11 ex.initialize();
12 return ex;
13 }
14}Reactive/non-blocking architectures
- Spring WebFlux (Project Reactor) provides non-blocking I/O and backpressure-aware flows. Use when you have many concurrent I/O-bound requests or long-lived connections (SSE, WebSocket).
- To benefit from reactive stacks end-to-end, use non-blocking drivers: R2DBC for relational databases, reactive MongoDB, reactive HTTP clients (WebClient).
- Benefits: fewer threads, lower memory footprint, higher concurrency for I/O-bound workloads.
- Costs: complexity in programming model, maturity and feature gaps in reactive drivers, debugging difficulty, and sometimes non-trivial migration from blocking code.
Simple WebFlux controller:
1@RestController
2public class ReactiveController {
3 private final ReactiveRepository repo; // Reactor-based repository
4
5 public ReactiveController(ReactiveRepository repo) { this.repo = repo; }
6
7 @GetMapping("/items")
8 public Flux<ItemDto> getItems() {
9 return repo.findAll()
10 .map(this::toDto);
11 }
12}Database & persistence optimizations (JPA/Hibernate)
Databases are often the largest source of latency. Key techniques:
- Minimize round trips: use joins, fetch joins, and proper queries instead of lazy loading that triggers N+1 queries.
- Use projections and DTO queries to fetch only needed columns instead of full entities.
- Use pagination to limit result sizes.
- Use database indexes and query plan analysis (EXPLAIN ANALYZE) to find slow queries.
- Connection pool tuning: use HikariCP (default in Spring Boot), set maximum pool size based on workload and DB capacity.
- Use prepared statements, batching, and bulk operations for many writes.
- Hibernate second-level cache or query cache for read-heavy scenarios (use cautiously, watch cache invalidation).
- Tune JDBC batch settings and hibernate.jdbc.batch_size for bulk inserts/updates.
- Monitor DB metrics: slow query log, connection pool wait times, lock timeouts.
Avoid N+1 example (JPA fetch join):
@Query("select p from Parent p join fetch p.children where p.id = :id")
Optional<Parent> findWithChildren(@Param("id") Long id);HikariCP basic config (application.yml):
1spring:
2 datasource:
3 url: jdbc:postgresql://db:5432/mydb
4 username: user
5 password: pass
6 hikari:
7 maximum-pool-size: 20
8 minimum-idle: 5
9 connection-timeout: 30000
10 idle-timeout: 600000
11 max-lifetime: 1800000Caching strategies
- In-process caches (Caffeine): extremely fast for per-instance cache with eviction and TTL.
- Distributed caches (Redis, Memcached): for cross-instance caching, session storage, rate-limiting, and shared caches.
- Spring Cache abstraction: use annotations @Cacheable/@CachePut/@CacheEvict and plug in implementations (Caffeine, Redis).
- HTTP-level caching: use ETag/Last-Modified and Cache-Control to allow downstream caches and browsers to avoid unnecessary requests.
- Query caching: Hibernate 2nd-level cache can help read-mostly tables, but introduces cache coherence complexity.
Caffeine configuration example:
1@Bean
2public CacheManager cacheManager() {
3 CaffeineCacheManager cm = new CaffeineCacheManager("items");
4 cm.setCaffeine(Caffeine.newBuilder()
5 .expireAfterWrite(Duration.ofMinutes(10))
6 .maximumSize(10_000));
7 return cm;
8}JVM and GC tuning for containers
- Choose a GC suited for your latency goals: G1GC (default in many JDKs) is good general-purpose; ZGC and Shenandoah target very low pause times for large heaps.
- Avoid relying on default memory assumptions when running in containers: use -XX:+UseContainerSupport (JDK11+ defaults) and set -XX:MaxRAMPercentage to size heap relative to container limits.
- Control metaspace and direct memory if using native libraries or high thread counts.
- Typical flags to start with:
- -Xms and -Xmx sizing to avoid heap resizing overhead.
- -XX:+UseG1GC -XX:MaxGCPauseMillis=200 for many services.
- -XX:MaxRAMPercentage=75.0 for containerized apps (adjust).
- Monitor GC logs (use -Xlog:gc* for JDK9+ or -XX:+PrintGCDetails for older) and tune accordingly.
Container & deployment considerations
- Right-size CPU and memory in containers; oversubscribing CPUs leads to CPU-steal.
- Set resource limits to avoid noisy neighbor effects and to enable proper scheduling.
- Use horizontal scaling (replicas) rather than pushing a single JVM to extreme thread counts.
- Use layered jars to enable efficient image caching in Docker with Spring Boot.
- Use JVM startup optimizations or consider GraalVM native-image / Spring Native to reduce startup time and memory footprints for serverless or short-lived functions (tradeoffs: native builds reduce latency and memory but may have limitations and longer build times).
- NUMA and CPU pinning may matter at very high throughput.
Observability, profiling & performance testing
- Instrument with Micrometer (Spring Boot integrates) and export to Prometheus/Grafana.
- Capture metrics: request latencies, error rates, GC pause times, allocations, DB wait times, thread pool usage.
- Use distributed tracing (OpenTelemetry, Zipkin, Jaeger) to find hotspots across services.
- Load-test realistic user patterns: ramp up/down, bursts, different payloads, caching warm-up, and long-running tests to see memory growth.
- Use flame graphs and CPU sampling for hotspots; allocation profiling for high object churn.
Resilience & scaling patterns
- Circuit breakers (Resilience4j) to fail fast on slow downstream services.
- Bulkheads (bounded thread pools, semaphores) to isolate failures and preserve throughput in other parts.
- Timeouts on all downstream calls and DB operations.
- Rate limiting at the gateway or service level to protect resources.
- Backpressure-aware endpoints (reactive) or client-side rate limiting.
Practical checklist (quick start)
- Measure baseline: metrics, traces, and synthetic load.
- Tune connection pools: DB (Hikari), HTTP client pools (Apache HttpClient or WebClient).
- Optimize slow database queries: EXPLAIN, indexes, projections, fetch joins.
- Add caching for expensive reads (Caffeine/Redis).
- Reduce payloads: remove fields, paginate, and compress responses.
- Reuse expensive objects (ObjectMapper) and enable Jackson Afterburner if appropriate.
- Limit and tune server threads; add proper timeouts so threads don’t hang.
- Implement timeouts and retries with backoff for external calls (WebClient with Reactor Retry, Resilience4j).
- Profile CPU and memory; tune JVM/G1GC or evaluate low-pause GC if necessary.
- Add tracing and dashboards to track p50/p95/p99 latency over time.
Concrete examples & code snippets
- WebClient with connection pooling and timeouts
1@Bean
2public WebClient webClient() {
3 HttpClient httpClient = HttpClient.create()
4 .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 5000)
5 .responseTimeout(Duration.ofSeconds(5))
6 .doOnConnected(conn ->
7 conn.addHandlerLast(new ReadTimeoutHandler(5))
8 .addHandlerLast(new WriteTimeoutHandler(5)));
9
10 return WebClient.builder()
11 .clientConnector(new ReactorClientHttpConnector(httpClient))
12 .build();
13}- Database batching with Spring Data JPA
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=trueAnd ensure your entities use ID generation compatible with batching (e.g., avoid GenerationType.IDENTITY).
- Avoid N+1 with DTO projection:
@Query("select new com.example.dto.UserDto(u.id, u.name, a.city) from User u join u.address a where u.id = :id")
UserDto findUserDto(@Param("id") Long id);- Enable Micrometer metrics (application.yml):
1management:
2 endpoints:
3 web:
4 exposure:
5 include: health,prometheus,metrics,logfile
6 metrics:
7 export:
8 prometheus:
9 enabled: truePerformance testing examples
- Use k6 script to simulate realistic traffic (ramping users, payload sizes).
- Warm caches before measuring read-heavy scenarios.
- Run tests for long enough to capture GC cycles and slow-path behavior.
Common pitfalls and anti-patterns
- Optimizing the wrong thing (not measuring).
- Large unbounded thread pools and unbounded queues leading to OOM and high GC.
- Blocking within reactive stacks (blocks Netty event loops).
- Caching without eviction strategy or staleness handling.
- Using huge entity graphs (JPA) and returning entities directly from controllers — prefer DTOs.
- Relying on autowiring per-request ObjectMapper / creating many temporary objects — increase allocation pressure.
Future trends & implications
- Reactive and non-blocking I/O adoption will grow, but only where the whole stack supports non-blocking paradigms.
- Native images (GraalVM) and AOT-compiled Spring (Spring Native / Spring AOT) reduce startup times and memory footprints, useful for serverless and microservices with many small instances.
- WebTransport/HTTP/3 might influence latency patterns in future network stacks.
- Observability and automated tuning (AI-driven optimizers) will increasingly propose configuration changes based on real-time telemetry.
- Hardware acceleration (e.g., specialized TLS offload, eBPF-based observability) will be leveraged more in high-scale deployments.
Conclusion
Improving API performance in Spring Boot is a multi-layer task: measure first, then apply targeted optimizations at the network, application, serialization, database, caching, JVM, and deployment layers. Use Spring Boot features (Hikari, WebClient, caching, WebFlux) wisely, match the concurrency model to your workload, and always validate changes under realistic load.
A pragmatic approach:
- Start with instrumentation and a baseline.
- Fix the highest-impact bottlenecks (DB queries, caches).
- Tune resource constraints (connection pools, thread pools).
- Optimize serialization and network settings.
- Consider reactive or native images only when they fit the full stack and requirements.
Further reading and tools
- Spring Boot reference docs (Server, WebFlux, Caching)
- Micrometer and Prometheus docs
- HikariCP tuning documentation
- Project Reactor and WebFlux documentation
- JVM tuning guides (G1GC, ZGC)
- async-profiler and Flame Graphs tutorials
- k6, Gatling, wrk benchmarking guides
If you want, I can:
- Review a specific performance report or flame graph and suggest targeted fixes.
- Provide a custom checklist tuned to your application's architecture (monolith, microservices, cloud provider).
- Draft a benchmarking script (k6 or Gatling) tailored to your API endpoints.