Many real-world applications, ranging from news recommendation and online advertising to personalized healthcare, are naturally modeled by the contextual-bandit protocol, where a learner repeatedly observes a context, takes an action, and accrues reward. A fundamental question in such settings is: given a new version of the system (i.e. policy), what is the expected reward? Online A/B testing offers a generic way for answering this question through controlled randomized trials. However, such online experimentation is slow, can only be done for a small number of new policies, has high engineering cost, and can have substantial cost for the users when the new policy is of low quality. Overcoming these shortcomings motivates the goal of offline A/B testing, also known as off-policy evaluation (OPE), which does not require new online experiments for every new policy want to evaluate, but instead reuses past data we already have. At the core of this methodology lies the design of counterfactual estimators that accurately evaluate the performance of a new policy by only using logged data of past behavior. In this talk, I will present my recent work on off-policy evaluation. It includes the discovery of a general family of counterfactual estimators, followed by a new optimization-based framework for designing estimators, which obtains a better bias-variance tradeoff than the doubly robust estimator in finite samples. Beyond off-policy evaluation, I will also briefly introduce the estimator selection problem in OPE. Finally, I will survey some of my recent work in off-policy learning: how do we use logged data to safely learn the best policy to deploy in the future.