Closing the Loop Between Vision and Language